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Abstract 

Structural ambiguity in linguistic analyses is a severe problem for natural language pro- 
cessing. For grammars describing a nontrivial fragment of natural language, every input of 
reasonable length may receive a large number of analyses, many of which are implausible or 
spurious. This problem is even harder for highly complex constraint-based grammars. Whereas 
the mathematical foundation of such grammars as instances of constraint logic programming 
is clear enough, there is so far no mathematically well-defined method for extending constraint 
logic programs by using weights to introduce graded distictions between analyses. Previous 
approaches to ambiguity resolution for context-dependent natural language processing models 
either are tailored to specific applications and based on uncertain mathematical grounds, or 
they are sufficiently well-defined and expressive but infeasible in practice. 

In this thesis, we present two approaches to a rigorous mathematical and algorithmic 
foundation of quantitative and statistical inference in constraint-based natural language pro- 
cessing. The first approach, called quantitative constraint logic programming, is conceptual- 
ized in a clear logical framework, and presents a sound and complete system of quantitative 
inference for definite clauses annotated with subjective weights. This approach combines a 
rigorous formal semantics for quantitative inference based on subjective weights with efficient 
weight-based pruning for constraint-based systems. The second approach, called probabilistic 
constraint logic programming, introduces a log-linear probability distribution on the proof 
trees of a constraint logic program and an algorithm for statistical inference of the param- 
eters and properties of such probability models from incomplete, i.e., unparsed data. The 
possibility of defining arbitrary properties of proof trees as properties of the log-linear prob- 
ability model and efficiently estimating appropriate parameter values for them permits the 
probabilistic modeling of arbitrary context-dependencies in constraint logic programs. The 
usefulness of these ideas is evaluated empirically in a small-scale experiment on finding the 
correct parses of a constraint-based grammar. In addition, we address the problem of com- 
putational intractability of the calculation of expectations in the inference task and present 
various techniques to approximately solve this task. Moreover, we present an approximate 
heuristic technique for searching for the most probable analysis in probabilistic constraint 
logic programs. 
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Chapter 1 



Introduction 



This thesis presents a novel mathematical treatment of the problem of structural ambiguity in 
constraint-based natural language processing (NLP). This problem will be attacked from two 
different angles. On the one side we will present a novel formalism for quantitative constraint- 
based inference with subjective weights. On the other side we will approach this problem by 
novel methods for statistical inference and probabilistic modeling for constraint-based NLP. 

In this chapter we introduce the general problem of structural ambiguity and a general 
solution to this problem, namely weighted grammars. Furthermore, we will specify the no- 
tion of constraint-based NLP and sketch the general idea of the two different approaches 
to ambiguity resolution for weighted constraint-based grammars which constitute the main 
contribution of this thesis. 

1.1 Overview 

Following this introduction, Chap. § discusses the formal framework in which the informal 
notion of constraint-based NLP will be dealt with in the course of this thesis. To this end, 
we discuss the formal basics of Constraint Logic Programming (CLP), which is used here 
to provide an operational treatment of various declarative constraint-based grammars. This 
is done by an embedding of the logical description languages of such grammars into a CLP 
scheme, yielding Constraint Logic Grammars (CLGs). 

Chap. ^ presents a quantitative extension of CLP which allows us to assign subjective 
numerical weights to the structural components of a constraint logic program. We present a 
sound and complete system for quantitative inference with such subjective weights based on 
concepts of fuzzy set algebra. Furthermore, the general concepts of quantitative CLP will be 
exemplified with a simple quantitative CLG and we will show how the search technique of 
alpha-beta pruning can be adapted to efficiently finding the best parse in quantitative CLGs. 
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A completely different approach to weighted CLP is presented in Chap. |j. Here, instead 
of concentrating on a formal specification of the handling of subjective weights, the aim is 
to use methods of probabilistic modeling and statistical inference to automatically induce 
weights from empirical data. We introduce a powerful log-linear probability model for CLP 
and present a novel technique for statistical inference of the parameters and properties of 
such models from incomplete training data. We show monotonicity and convergence of the 
algorithm to the desired maximum likelihood estimates and discuss various methods for ap- 
proximate computation for the inference task. We present an instantiation of probabilistic 
CLP to a simple probabilistic CLG and show how the structure of the probabilistic model 
can be used to guide the search for the most probable analysis. Furthermore, the main con- 
cepts of this statistical approach are evaluated empircally in a small experiment on finding 
the correct parses of a constraint-based grammar. 

Chaps. |3] and |], presenting the two different approaches to weighted CLP and CLGs, are 
conceptualized completely independent of each other. Whereas Chap. || is based upon the 
general concepts of Chap. [2], namely classical CLP with CLGs as a special instance, the work 
of Chap. H] is entirely self-contained and even more general. That is, the presented methods 
of probabilistic modeling, statistical inference and approximate computation can easily be 
abstracted away from the CLP application to more general data structures. 

Chap. U presents a summary of the work of this thesis, and compares the advantages and 
shortcomings of the two presented approaches relative to each other and relative to other 
approaches. Furthermore, directions of future work are sketched. 

The rest of this chapter presents a motivation of the why and how of the work of this 
thesis. 

1.2 A Practical Problem: Structural Ambiguity 

Structural ambiguity is a practical problem for every grammar describing a nontrivial frag- 
ment of natural language. That is, for such grammars every input of reasonable length may 
receive a large number of different analyses, many of which are not in accord with human 
perceptions. The problem to be addressed is how to differentiate between these analysis and 
how to efficiently find the correct analysis out of the set of all possible ones. 

A simple example illustrating the ubiquity and severity of the problem of structural am- 
biguity has been presented by |Church and Patil (1982| ). Consider the following sentence with 
two PPs. It has the following two analyses in terms of PP-attachment: 

(1) a. Put the block [in the box on the table], 
b. Put [the block in the box] on the table. 
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If we have three PPs, the number of analyses is five. 

(2) a. Put the block [[in the box on the table] in the kitchen]. 

b. Put the block [in the box [on the table in the kitchen]]. 

c. Put [[the block in the box] on the table] in the kitchen. 

d. Put [the block [in the box on the table]] in the kitchen. 

e. Put [the block in the box] [on the table in the kitchen]. 

Continuing this list further, a number of more than thousand analyses is achieved quickly 
with only eight PPs. The pattern behind this list can be explained as a combinatorial growth 
of ambiguity in the number of PPs. This growth pattern follows the combinatorial principle 
of the the Catalan numbers, where Cat(n) describes the number of ways to parenthesize a 
sentence of length n, or equivalently the set of binary trees that can be constructed over n 
terminal elements^. Clearly, this pattern can be found also in other linguistic combinations 
such as conjuncts, nominal modifications, or relative clauses. 

Whereas ambiguities of this kind are only problematic if the number of linguistic elements 
to be combined is large, there is another source of ambiguity depending simply on the number 
of analyses the grammar can produce at all. Let us consider the standard linguistic example 
sentence John saw Mary and the two analyses given below. 

(3) a. [Johnjv [sawy MaryAr ]yp ]g. 
b. [[John^ sawjy ]np Maryjv ]np- 



Even if the first analysis is perfectly plausible and might be considered as the unique analysis 
of this sentence, the second analysis has to be accepted if the grammar also licenses other 
nominal modifications such as 

(4) [[school^ committee at ]np meetings ]np- 



Following Abney (1996| ), the second analysis furthermore can be given a perfectly plausible 



interpretation as the reference to a person named Mary who is associated with a kind of saw 
called John saw. Clearly, such spurious ambiguities may be characterized as resulting from 
rare usages of words and constructions, but they will appear in every grammar which covers a 
reasonable fragment of natural language and thus produces a large number of analyses. Fur- 
thermore, in most cases such spurious ambiguities cannot be given a plausible interpretation, 
but just have to be accepted as a side-effect of high coverage. 



The Catalan numbers are generated by the following formula: Cat(n) = 



2n \ I 2n 

n-1 
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Together combinatorial and spurious ambiguity can confront NLP systems with severe 
problems. Clearly, there is a need to distinguish more plausible analyses of an input form less 
plausible or even totally spurious ones. A practical and general approach to this problem is 
the use of weighted grammars for resolving structural ambiguities. 

1.3 A Practical Solution: Weighted Grammars 

We will approach the problem of structural ambiguity by using weighted grammars for am- 
biguity resolution. Weighted grammars can be characterized very generally as follows. They 
assign numerical values, called weights, to the structure-building components of a grammar 
and calculate the weight of an analysis from the weights of the structural components that 
make it up. The simple but effective assumption is to connect the plausibility of an analysis 
with its weight. That is, a ranking of analyses is defined by the weighted grammar, and more 
plausible analyses are differentiated from less plausible analyses in terms of their weights. The 
most plausible or correct analysis then is chosen from among the in-principle possible analyses 
by assuming the analysis with the greatest weight to be the correct one. Furthermore, when 
we are interested only in the highest weighted parse, the weight calculation scheme can be 
used to guide the search for the highest weighted parse efficiently instead of simply listing all 
possible parses and choosing the highest weighted one. 

There are three basic problems to be solved for every weighted grammar to be a useful 
device in real-world NLP applicatons. These problems can be described by the following 
questions. 

1. How can the values of the weights be obtained? 

2. How should the weights be applied to the components of the grammar and how should 
the weight of an analysis be calculated from the weights of the components? 

3. How can the structure of the weight calculation scheme be used to guide the search for 
the highest weighted analysis efficiently? 

Clearly, the answers to these questions depend on each other and on the non-weighted 
framework to be extended. In the following we will sketch the basic ideas of two different 
approaches to answer these questions consistently for a framework of constraint-based systems. 
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1.4 Towards a Mathematical Foundation of Weighted 
Constraint-Based Grammars 

The NLP systems of choice in this thesis are constraint-based grammars. The term constraint- 
based is a collective name for highly expressive frameworks for declarative description of nat- 
ural language in terms of logical description languages. Throughout this thesis, the informal 
concept of constraint-based grammars will be replaced by the formal concept of constraint 
logic grammars. That is, constraint-based grammars are formalized here by an embedding 
of the logical description languages of such grammars into a CLP scheme, yielding CLGs 
as special applications of CLP. The advantages of this approach are on the one hand the 
(Turing-) power of the underlying logic, which is conceived as a welcome property to overcome 
the inadequacy of regular and context-free grammars for the description of natural language. 
On the other hand this approach permits an operational treatment of, e.g., the parsing prob- 
lem for arbitrary constraint-based grammars in a consistent and unique way. Since CLGs can 
be seen as special applications of CLP, the mathematical work of this thesis will be based 
upon CLP in general, and CLGs will serve as running example illustrating the applicability 
of the general work to NLP. The reference to the general framework of CLP will generalize 
the results of this thesis in a welcome manner. 

However, most CLP applications require some form of graded distinctions which are not 
provided by a classical CLP scheme. A very important example for this demand for gradedness 
is the task of structural ambiguity resolution in CLGs. A crucial assumption in this thesis 
is the claim that a framework of weighted CLP is the solution of choice for the ambiguity 
resolution problem for CLGs. In the following chapters we will present a rigorous mathematical 
formulation of two different approaches to weighted CLP and weighted CLGs. 

The first approach we will present is motivated by the aim to give the grammar designer 
and implementer maximal freedom in choosing appropriate values for the weights of the 
weighted grammar. That is, the values of the weights are only restricted to be some quantities 
lying in a certain interval of real numbers. Such weights can be restricted to meet the axioms 
of probability theory, but there is no need to do so. Besides subjective probabilities, such 
quantities could be subjective preference values, or values obtained from experiments on 
preferences in human language processing, or values describing human judgements on degrees 
of grammticality, or others. In order to stress the generality of this approach to weighted 
CLP and weighted CLGs, we will henceforth refer to it as quantitative CLP and quantitative 
CLGs, respectively. The main task of this approach is to specify the questions of how to 
establish a proper weight calculation scheme for given values and of how to use such a scheme 
for efficient disambiguation. Since it is the grammar designer and implementer who has to 
specify the grammar and the weights, it makes sense to tie these two tasks together as closely 
as possible. That means, in the same way as the inference system of classical CLP is coupled 
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with a clear formal semantics, one would like to relate a quantitative inference system to a 
quantitative formal semantics, instead of adding an extralogical calculation scheme to the 
well-defined logic of CLP. Thus the task to be addressed is to provide a precise, but yet 
simple formal semantics for quantitative inference in CLP. To this end, we present a formal 
semantics for quantitative CLP based upon the simple and intuitive concepts of fuzzy set 
algebra. This semantics and the corresponding sound and complete quantitative inference 
system furthermore are designed in a way which enables the search technique of alpha-beta 
pruning to be used quite directly for efficient disambiguation. Quantitative CLP then provides 
an efficient, well-defined quantitative deduction system, which can be adapted for specific 
applications by embedding specific constraint languages into CLP and attaching appropriate 
weights to them. 

A completely different approach to weighted CLP and weighted CLGs is presented by our 
models of probabilistic CLP and probabilistic CLGs. The aim of this approach is to specify 
a probability distribution over the set of proof trees of CLP or the parses of CLGs, and to 
provide statistical methods to infer the values of the parameters of such probabilitic mod- 
els from empirical data. For a given sample of training data and a parametric probability 
model, both the parameters of the probabilistic model and the properties of the model associ- 
ated with these parameters can be induced automatically by methods of statistical inference. 
We present a highly expressive log-linear probability model for CLP, and a novel algorithm 
to infer the parameters and properties of log-linear models from incomplete data. We show 
monotonicity and convergence of the new algorithm and discuss methods for efficient approx- 
imate computation of the formulae involved in the algorithm. This algorithm is applicable to 
log-linear models in general, and especially provides the means for automatic and reusable 
training of arbitrary probabilistic constraint-based grammars from unparsed data. The useful- 
ness of these concepts is shown empirically in a small-scale experiment on finding preferences 
in parse-data from a constraint-based grammar. Furthermore, we discuss the possibilities of 
using the structure of the probabilistic model to guide the search for the most probable proof 
tree or analysis, and present a heuristic search algorithm for this task. Clearly, in this setting 
a model-theoretic semantics for probabilistic inference is superfluous since the values of the 
probabilistic parameters are obtained by automatic statistical methods which are not ma- 
nipulable by the user. Rather, we are interested in a stochastic semantics for CLP inference 
which is determined by the log-linear probability model together with the statistical methods 
for parameter estimation and property selection from given input data. 



1.5 Bibliographical Note 



Various parts of this thesis are based upon previously published work of the author. Chap. [3| is 
an extended version of Piezler (1996| ). Chap. | is based upon work presented in piezler (1997] ) , 
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Chapter 2 



Foundations: Basic Concepts of 
CLP and CLGs 



In this chapter we report the central formal concepts of the CLP scheme of Hohfeld and 
[Smolka (1988| ), In preparation for the following work we give some proofs missing in the 
original paper and present the CLP scheme in a slightly modified fashion. Furthermore, in 
order to prepare the running example of the next chapters, we report the main concepts of a 
feature-based constraint language for HPSG and show how to embed this constraint language 
into the CLP scheme, yielding feature-based CLGs. 



2.1 Introduction and Overview 



Constraint logic programming is a powerful extension of conventional logic programming 
( Lloyd 1987| ), and involves the incorporation of constraint languages and constraint solving 



methods into logic programming languages. The name CLP was first introduced by Jaffar and 



Lassez (1986|) for a general framework of a logic programming language that is parametrized 



with respect to constraint language and a domain of computation, and yields soundness 
and completeness results for an operational semantics relying on a constraint solver for the 
employed constraint language. For example, conventional logic programming or Prolog is ob- 
tained from CLP by employing equations between first order terms as constraint language 
and by interpreting these equations in the Herbrand universe. In this case the operational 
semantics of SLD-resolution can be seen to rely on a constraint solver which solves term 
equations in the Herbrand universe by term unification. Recent extensions, refinements, and 
various applications of CLP are discussed in laffar and Maher (1994| ) . In the following we will 



rely on the general CLP scheme of Hohfeld and Smolka (1988), which has been shown to be a 



useful tool for our intended application of linguistic knowledge representation (see Dorre and 
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Dorna (1993), |Gotz (1995| ), |Gotz and Meurers (1995| )). 

The term constraint logic grammars expresses the connection between CLP and constraint- 
based grammars. That is, CLGs are understood as grammars formulated by means of a suit- 
able logical language which can be used as a constraint language in the CLP scheme of Hohfcld 



and Smolka (1988 ). The idea behind this connection is to provide an operational treatment 



of purely declaratively specified grammars. This needs further explanation: Constraint-based 
grammars enable a clear model-theoretic characterization of linguistic objects by specifying 
grammars as sets of descriptions from a suitable logical description language, called the con- 
straint language. The descriptions, called constraints, are stated as axioms required to be true 
of every object in the domain to be described, i.e., they constrain the admissible models of 
the grammar. The parsing problem (and similarly the generation problem) can be defined as 
follows: Given a set of axioms (encoding the grammar) and some constraint (j) (encoding the 
string/logical form we want to parse/generate from), we ask if there is some model of our 
axioms which satisfies eft. Following pptz| (to appear), we will call this the prediction problem. 

A well-known subclass of these grammars widely used in computational linguistics are 
grammars based upon feature description languages such as simple PATR grammars (Shieber 
1986 ) or more expressive grammars such as LFG ( [Bresnan and Kaplan 1982 ) or HPSG ( Pol 



lard and Sag 1994). Formalizations of the more or less informal notions of these grammars 



in terms of first-order languages were firstly presented by [Smolka (1988; ) for PATR and by 
Johnson (198Sp and |King (198SQ , |King (1994) for LFG and HPSG, respectively. 

However, such model-theoretic approaches do not necessarily provide an operational in- 
terpretation of their declarative specifications. This may lead to problems with an operational 
treatment of model-theoretically well-defined problems such as parsing or generation. CLP 
provides one possible approach to an operational treatment of various such frameworks by 
embedding arbitrary logical languages into constraint logic programs. Definite clause speci- 
fications over such constraint languages then define grammars as constraint logic programs, 
i.e., as sets of axiomatic interpreted definite clauses. The prediction problem is in this setting 
as follows: Given a program V (encoding a grammar) and a definite goal G (encoding the 
string/logical form we want to parse/generate from), we ask if we can infer an answer ip of G 
(which is a satisfiable constraint encoding an analysis) proving the implication ip — > G to be 
a logical consequence of V. 



For feature-based grammars an embedding of a logical language close to that of Smolka 



(1988) into the CLP scheme of Hohfcld and Smolka (1988) is done in the formalism CUF 



(Dorre and Eisele 1991; Dorre and Dorna 1993] ). This approach quite directly offers the op- 
erational properties of the CLP scheme, but unfortunately gives up the connection to the 
model-theoretic specifications of the underlying feature-based grammars. A different approach 
is given by Gotz (1995), |Gotz and Meurers (1995| ), who defines an explicit translation from 
a logical language close to that of King (1994| ) into constraint logic programs. This trans- 
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lation procedure preserves the prediction problem by generating a constraint logic program 
V{Q) from a feature-based grammar Q in an explicit way. Other approaches to an opera- 
tional semantics for the prediction problem of feature-based languages have been presented, 
e.g., by Carpenter (1992|) , A'it-Kaci, Podelski, and Goldstein (1993; ) or |Gotz (to appear). 



These approaches are tailored especially for specific feature-based languages and clearly suit 
the particular frameworks better than an embedding of the specific languages into a CLP 
scheme. However, under the CLP approach, arbitrary constraint-based grammars can receive 
an unique operational semantics by an embedding into definite clause specifications^]. 

We see the main advantage of the CLP approach in the possibility to rely on the well- 
understood paradigm of logic programming. This allows the resulting programs to run on 
existing architectures and to use well-known optimization techniques worked out in this area. 
The possibility to embed arbitrary constraint languages into the CLP scheme and the broad 
applicability of CLP itself should generalize the work of the following chapters in a welcome 
manner. 

This chapter is organized as follows. In Sect. |2.2| we will report the main concepts of con- 



straint logic programming following the CLP scheme of Hohfeld and Smolka (1988 ). As the 
work in the next chapters will build upon this scheme, we will reformulate the main defini- 
tions and propositions of Hohfeld and Smolka (198§| ) in a form convenient for the following 



discussions, and give some missing proofs which will be helpful to make this work parallel to 
the work of the next chapters. 

In order to provide a concrete instantiation of this CLP scheme to constraint logic gram- 



mars, we will report in Sect. 2.3 a feature-based constraint language and show how this 



language can be embedded into the CLP scheme to yield feature-based CLGs. 



2.2 Constraint Logic Programming 



The scheme presented by Hohfeld and Smolka (198*^ ) generalizes conventional logic program- 



ming (Lloyd 1987) and also the constraint logic programming scheme of laffar and Lassez 



|(1986 ) to a scheme of definite clause specifications over arbitrary constraint languages. Re- 
lying on terminology well-known for conventional logic programming, Hohfeld and Smolka's 
generalization of the key result of conventional logic programming can be stated as follows: 
First, for every definite clause specification V in the extension of an arbitrary constraint lan- 
guage C , every interpretation of C can be extended to a minimal model of V . Second, the 
SLD-resolution method for conventional logic programming can be generalized to a sound 
and complete operational semantics for definite clause specifications, which are not restricted 



For example, an embedding of a the logical language for tree-description grammars of Rogers (1994) into 
the CLP scheme of |Hohfeld and Smolka (1988|) is given in |Morawietz (1997|). 
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to Horn theories. In contrast to Jaffar and Lassez (1986), in this scheme constraint languages 



are not required to be sublanguages of first order predicate logic and do not have to be in- 
terpreted in a single fixed domain. Instead, a constraint is satisfiable if there is at least one 
interpretation in which it has a solution. This makes this scheme usable for a wider range of 
applications. Furthermore, such interpretations do not have to be solution compact^. This was 
necessary in |Jaffar and Lassez (1986|) to provide a sound and complete treatment of negation 



as failure. Hohfeld and Smolka (1988| ) do not include negation as failure but rather let the 



embedded constraint language provide for logical negation. 
2.2.1 Constraint Languages 

A very general characterization of the concept of constraint language can be given as follows. 
Definition 2.1 (£)• A constraint language C consists of 

• an C -signature, specifying the non-logical elements of the alphabet of the language, 

• a decidable infinite set VAR whose elements are called variables, 

• a decidable set CON of C -constraints which are pieces of syntax built from the C - 
signature, the variables in VAR, and the logical elements of the alphabet of the language, 

• a computable function V assigning to every constraint (ft € CON a finite set V((ft) of 
variables, the variables constrained by (ft, 

• a nonempty set of C -interpretations INT, where each C -interpretation T G INT is 
defined w.r.t. a nonempty set T>, the domain of I , and a set ASS of variable assignments 
a : VAR -> V, 

• a function f-J 1 mapping every constraint (ft E CON to a set fiftf 1 of variable assignments, 
the solutions of (ft in 2 . 

• Furthermore, a constraint (ft constrains only the variables in V((ft), i.e., if a € {(ft} 1 and 
[3 is a variable assignment that agrees with a on \/((ft), then (3 £ {(ft] 1 ■ 

In order to state certain closure conditions on constraint languages, further definitions are 
necessary. The following definitions are made with respect to some given constraint language. 

Definition 2.2. 

• A renaming is a bijection VAR — > VAR that is the identity except for finitely many 
exceptions. 



2 That is, it is not necessary that every element of an interpretation must be obtainable as the unique 



solution of a possibly infinite set of constraints. See Jaffar and Lassez (1986) 
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• A constraint <fi' is a p- variant of a constraint <j) under a renaming p iff <p' = (f>p, i.e., 
<fi' is the constraint obtained from <j) by simultaneously replacing each occurence of a 



the solutions of <p, for all interpretations X . 

• A constraint 4>' is a variant of a constraint (ft if there exists a renaming p s.t. <p' is a 
p-variant of (p. 

The following closure conditions on constraint languages will be convenient in the further 
discussion. 

Definition 2.3. A constraint language is 

• closed under renaming iff every constraint has a p-variant for every renaming p, 

• closed under intersection iff for every two constraints 4> and 4>' there exists a con- 
straint rb s.t. m 1 n \<j> 'f = \rbf for every interpretation X , 

• decidable iff the satisfiability of its constraints is decidable. A constraint <\> is satisfiable 
iff there exists at least one interpretation in which 4> has a solution. 

2.2.2 Relationally Extended Constraint Languages 

To obtain constraint logic programs, a given constraint language C has to be extended to 
a constraint language TZ(C) providing for the necessary relational atoms and propositional 
connectives. 

Definition 2.4 (11(C) ). A constraint language 71(C) extending a constraint 
language C is defined as follows: 

• The signature of 1Z(C) is an extension of the signature of C with a decidable set 1Z of 
relation symbols and an arity function Ar : 1Z — > IN. 

• The variables of 11(C) are the variables of C . 

• The set of 1Z(C)- constraints is the smallest set s.t. 

1. 4> is an 1Z(C)- constraint if <fi is an C-constraint, 

2. r(x) is an 1Z(C) -constraint, called an atom, if r G 1Z is a relation symbol with 
arity n and x is an n-tuple of pairwise distinct variables, 

3. 0, F &: G, F — > G are 1Z(C) -constraints, if F and G are 1Z(C) -constraints, 



variable X in (j) by p(X) for all variables X in V(0), and so [</>] 
a € {(f)'} 1 }, i-c, the function compositions of the solutions of <p r a 
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4- (f> & B± Sz . . . & B n — > A is an 1Z{C)- constraint, called a definite clause, if A, 
B u ... ,B n are atoms and <fi is an C- constraint. We may write a definite clause 
also as A <— & B± & . . . & B n . 

• The variables constrained by an 1Z(C) -constraint are defined as follows: If (p is an C - 
constraint, then V(</>) is defined as in C ; V(r(xi, . . . ,x n )) := {x±, . . . ,x n }; V(0) := 0; 
V(F & G) := V(F) U V(G); V(F -» G) := V(F) U V(G). 

• For eac/i £ -interpretation X , an 1Z(C) -interpretation A is an extension of an C-inter- 
pretation X with relations r- 4 on the domain V of A with appropriate arity for every 
r G 7Z, and the domain of A is the domain of X. 

• For each 1Z(C) -interpretation A , for each C -interpretation X , [-J- 4 is a function map- 
ping every 1Z{C) -constraint to a set of variable assignments s.t. 

1. 14>}^ = {(fif 1 if <f) is an C- constraint, 

2. {r{x)\ A = {a G ASS| a(x) G r A }, 

3. [01" 4 = ASS, 

I {F k G] A = {Fj A n {Gj A , 

5. {F — > Gj A = (ASS \ {Fj A ) U {Gj A . 

Note that we slightly abuse the notation a(x) to abbreviate the notation 
(a(xi), a(x2), ■ ■ ■ ,a(x n )) for a n-tuple of objects assigned to a n-tuple x of variables by a 
variable assignment a. 

2.2.3 Syntax and Declarative Semantics of Definite Clause Specifications 

The concept of a constraint logic program now can be defined as a definite clause specification 
over a constraint language. 

Definition 2.5 (Definite clause specification). A definite clause specification V over a 
constraint language C is a set of definite clauses from a constraint language TZ(C) extending 
C. 

Models of definite clause specifications are determined by the definite clauses constituting 
these specifications, i.e., a definite clause specification has its definite clauses as its axioms. 
For reasons of generality, the following two definitions are made with respect to general sets 
of 1Z{C) -constraints. 

Definition 2.6 (Model). An 71(C) -interpretation A is a model of a set $ of 1Z(£) -con- 
straints iff for every a G ASS, for every tp G ^ : a G \ip} A - 
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For convenience we furthermore introduce the concept of logical consequence. 

Definition 2.7 (Logical consequence). An 1Z(C)- constraint ^ is a logical consequence of 
set of 11(C) -constraints iff, for every 1Z(C) -interpretation A, A is a model of ^ implies 
that A is a model of ip. 

A goal G is defined as a possibly empty conjunction of /^-constraints and 7£(£)-atoms. 

Given a definite clause specification V and a goal G, a "P-answer of G is defined as a 
satisfiable /^-constraint 4> such that the implication cp — > G is a logical consequence of V . 

In order to show that the semantic properties of conventional logic programming ex- 
tend to CLP, [Hohfeld and Smolka (198S| ) first define a partial ordering on the set of 1Z(C)- 
interpretations. 1Z(C) -interpretations extending the same /^-interpretation Xare called base 
equivalent, and Xis called the base of these 1Z(C) -interpretations. A partial ordering on such 
1Z(C) -interpretations is defined via a partial ordering on the set of the denotations of the 
relation symbols in these interpretations. We get for all base equivalent 1Z(C) -interpretations 

• A C A' iff for each n-ary relation symbol r E 1Z : r^ C r" 4 , 

• A = (J X iff for each n-ary relation symbol r E 1Z : r^ = |J{r' 4 '| A' E X}, 

• A = P| X iff for each n-ary relation symbol r E 1Z : r" 4 = HI 7 " 4 '! E X}. 

This set of base equivalent 7£(£)-interpretations is a complete lattice under the partial order 
of set inclusion. That is, for every set of base-equivalent 1Z(C) -interpretations we have a 
supremum, given by the union, and an infimum, given by the intersection of the interpretations 
in the set. The top element is the 1Z(C) -interpretation A T such that for each n-ary relation 
symbol r E 1Z : r^ T = £> Ar ( r ), and the bottom element is A 1 - s.t for each n-ary relation 
symbol r E 1Z : r- 4± = 0. 

Proposition 2.1, due to Hohfeld and Smolka (1988 ), generalizes the fixpoint- or lattice- 



theoretic semantics of conventional logic programming to CLP. It says that for each C - 
interpretation X , a definite clause specification V in 1Z(C) defines unique minimal denotations 
for the relation symbols of 1Z . That is, every C -interpretation I can be used to construct a 
minimal model for V in 1Z(C) . All questions concering the declarative semantics of CLP 
can then be dealt with in terms of a minimal model semantics. Moreover, a minimal model 
semantics is crucial for the construction of a sound and complete deduction system for CLP. 



Proposition 2.1 ( [Hohfeld and Smolka (1988|) , Theorem 4.4.). Let 2 be an C- 



interpre-tation and V be a definite clause specification in 1Z(C) . Then the equations 
r A i+ i .— { a (x)\ there is a clause (r(x) <— G) E V and a E [G]' 4 '} 
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(i) define a chain Ao Q A\ C . . . of 1Z(£) -interpretations extending I, 

(ii) the union A := Ui>o is a model ofV extending Z , 
(Hi) A is the minimal model ofV extending X . 



Proposition 2.2 connects the concept of a "P-answer with the minimal model semantics of 



V (see Hohfeld and Smolka (1988|), Proposition 4.5.). This proposition justifies the restriction 



of the declarative semantics of CLP to a minimal model semantics. We prove this proposition 
explicitly with reference to the concept of logical consequence. 

Proposition 2.2. For each definite clause specification V in 1Z(C) , for each goal G, for each 
C -constraint <fi: 4> — ► G is a logical consequence ofV iff each minimal model A ofV is a model 
of4>^G. 

Proof. If: For each minimal model AoiV: A is a model of cp — > G 

for every model B of V base equivalent to some minimal model AoiV: B is a model of 



4> — > G, since A C B by Proposition 2A 
=> 4> — > G is a logical consequence of V. 
Only if: <f> — ► G is a logical consequence of V 



every model of V is a model of 4> ~* G, by Definition 2.7 



A is a model of 6 — > G. □ 



2.2.4 Operational Semantics of Definite Clause Specifications 

The following definitions are made with respect to some implicit C , 1Z , V, and V, where V 
denotes the finite set of variables in the query and the V-solutions of a constraint <p in an 

interpretation Tare defined as := {a|v| ct £ M" 1 } an d a|v is the restriction of a to V. 

i 1 

Hohfeld and Smolka (1988) define the generalization of the SLD-resolution rule by a binary 

relation -—*■, called goal reduction, on the set of goals. The rule selects the leftmost atom in 
the goal, looks for a variant of a program clause with the selected atom as head, and replaces 
the selected atom in the goal by the body of the variant clause. Furthermore, the rule ensures 
that no accidental variable sharing is introduced by the variant. 

A & G F & G if A F is a, variant of a clause in V 
s.t. (VUV(G)) nV(F) C V(A). 

A second rule takes care of constraint solving for the /^-constraints appearing in subsequent 
goals. The rule takes the conjunction of the /^-constraints from the reduced goal and the 
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applied clause and gives, via the black box of a suitable C- constraint solver, a satisfiable C- 
constraint in solved form if the conjunction of /^-constraints is satisfiable. If the conjunction of 
C -constraints is not satisfiable, an C -constraint _L denoting failure is returned. The constraint 
solving rule can then be defined as a total function — — > on the set of goals. 

4> & 4>' & G -^-> 4>" k G if [0 & 4>%^{G) = [</>"] vuv(G) 

for all /^-interpretations I and for all C -constraints 0, <f> and (j)" . 

Furthermore, a complexity measure that mirrors the construction steps of a minimal model 
in the complexity of goal reduction is introduced. This measure will be crucial for proving 
completeness of goal reduction. 

• The complexity of a variable assignment a for an atom A in the minimal model A where 
a G {A] A is defined as 

comp(ct,A,A) := min{i| a G [A]" 4 *}; 

• The complexity of a for goal G in A where a G {GJ A is 

comp(a, G, A) := {comp(a, A, A )| A is an atom in G} 
where {. . . } is a multiset; 



The V-complexity of a for G in A where a G is 



compy(a, G, A) := min{comp(/3, G, „4)| [3 G [G] and a = /3|v} 

where /3|v is the restriction of (3 to the variables in V, and the minimum is taken with 
respect to a total ordering on multisets such that M < M' iff Vx G M\M', 3x' G M'\M 
s.t. x < x' . 

I I 

Hohfeld and Smolka (1988) prove the following propositions showing that goal reduction is 

a sound and complete rule for deducing "P-answers from general definite clause specifications. 



We prove the main results explicitly in Propositions [2.4| (soundness) and 2.6 (completeness) 



Note that soundness and completeness can be proven without reference to constraint solving, 



Proposition 2.3 ( [Hohfeld and Smolka (1988| ), Proposition 5.1.). If Gi G 2 , th 

IG 2 } A C lGt} A for every model A of V . 

Proposition 2.4. IfG-^*(f>, then (j) G is a logical consequence ofV. 
Proof. G^*0 



en 
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C [G]" 4 for every model .4 of P, by Proposition 2.2 and transitivity of C 



for every model A of P: \4> —> = ASS, since for every model A of V: Q [[G]- 4 



for every model A of V: A is a model of — ► G, by Definition 2.6 



G is a logical consequence of V. □ 



Proposition 2.5 ( [Hohfeld and Smolka (1988 ), Theorem 5.2.). Let C be closed under 
renaming, A be a minimal model ofV, G\ be a goal, A be an atom in G\, and a G [Gi]y. 
Then there exists a clause C inV and a goal G2 s.t. G\ — — > G2 using a variant of C on A is 
possible, a € [G^Jy and comp\/(a,G2, A) < compy{a,G\, A) . 



Proposition 2.6 ( [Hohfeld and Smolka (1988 ), Corollary 5.3.). Let C be closed under 
renaming, A be a minimal model of V , G be a goal and a € [G]y . Then there exists a 
V -answer 4> of G s.t. G — and a € 

Proof. The result is proven by induction on comp\j(a, G, A). 

Base: Goals with mulitset complexity have to be a satisfiable /^-constraint <f>. Then <f> — 
and (j) is a "P-answer of itself. 

Hypothesis: Suppose the result holds for goals with multiset complexity less than some mul- 
tiset N. 

Step: comp\z(a f ,Gi, A) = N and a' S [Gi]y 

=^ there exists a clause C of V and a goal G2 s.t. G\ — —> G2 and a' € [G^Jy and 



comp\j(a' ,G2,A) < compy(a' ,G\,A), by Proposition 2.5 



there exists a P-answer <f> of G2 s.t. G2 -^*4> and a' € \(j>\y, by the hypothesis 



there exists a "P-answer <fi of G± s.t. G\ — >*(f> and a' G [^Jy, and by Proposition 2.4 , 
— > G\ is a logical consequence of P. 

The result follows by arithmetic induction. □ 

In all following examples, we will use a standard Prolog resolution procedure for the CLP 
scheme of Hohfeld and Smolka (198§| ), i.e., we combine the left-right selection rule defined 



in goal reduction with a depth-first search rule. Furthermore, after each goal reduction step, 
constraint solving is applied, and another clause is tried immediately if constraint solving 
fails. Moreover, it will be convenient in the following discussion to view the search space 
determined by the derivation rules — and — —> as a search of a tree. A derivation tree is 
defined as follows. 
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Definition 2.8 (Derivation tree). A derivation tree determined by a query G\ and a def- 
inite clause specification V has to satisfy the following conditions: 

1. Each node is either a relation node or a constraint node. 

2. The successors of every relation node are all constraint nodes s.t. for every — — » -resolvent 
G' obtainable by a clause C from goal G in a relation node, there is a successor constraint 
node labeled by C and G' . 

3. The successors of every constraint node are all relation nodes s.t. for the unique — — > - 
resolvent GSz(f>" obtainable from goal G&i^&Kp 1 in a constraint node, there is a successor 
relation node labeled by G &z <p" . 

4- The root node is a relation node labeled by G\. 

5. A success node is a terminal relation node labeled by a satisfiable C -constraint. 

Successful derivations correspond to subtrees of derivation trees which are labeled by 
terminal success nodes. Such trees can be defined as proof trees as follows. 

Definition 2.9 (Proof tree). A proof tree for a query G\ from V is a subtree of a derivation 
tree determined by G\ and V and is defined as follows: 

1. A relation node of the proof tree is a relation node of the supertree and takes one of the 
successors of the relation node of the supertree as its successor node. 

2. A constraint node of the proof tree is a constraint node of the supertree and takes the 
unique successor of the constraint node of the supertree as its successor node. 

3. The root node of the proof tree is the root node of the supertree. 

4- The terminal node of the proof tree is a success node of the supertree, labeled by a 
satisfiable C -constraint, called answer constraint. 

Let us illustrate the basic concepts of CLP with an example. A simple program consisting 
of clauses 1 to 3 is depicted in Fig. pj| . 

lq(X)<-p(X). 

2 p(X) <- X = a. 

3 p(X) ^X = b. 

Figure 2.1: Constraint logic program 

The C -constraints are considered to come from a language of hierarchical types, where 
the ordering on types is defined via the operation of set inclusion on their denotations. In our 
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example, we have [ap C [ep, [ftp C [ep and [a] 1 n = 0. This hierarchy is depicted 



graphically in Fig. 2.2 



Figure 2.2: Type hierarchy 



The construction of a minimal model for the program of Fig. 2.1 is shown in Fig. 2.3. The 
unique minimal denotations of the relation symbols p and q are obtained in step 1 and 2 of 
the minimal model construction respectively. 

pA) = 5 q -4 = j 

p^={[aF,[6p},q^=0, 

p^ = {Iop,l6F},q^ = {[aF,l6F}, 



p- 4 = {[af, I&f}^ = {M X , lb} 1 }, where >L = U i>0 A 



Figure 2.3: Minimal model construction for constraint logic program 



q(X) & X = e 



1, p(X) kX = e 



p(X)kX = e 



2, X = ekX = a 3, X = ekX = b 

c c 



X = a 



X = b 



Figure 2.4: Derivation tree for constraint logic program 



A derivation tree for the query q(X) SzX = e from the program of Fig. 2.1 is given in Fig. 



2.4. We depict only the success branches of the derivation tree, yielding two distinct proof 
trees for the query, with answer constraints X = a and X = b respectively. 

Soundness of the CLP scheme implies that corresponding to the derivation of X = a and 
X = b, we know that the implications X = a — > q(X) & X = e and X = b — ► q(X) & X = e 



are logical consequences of the program of Fig. 2.1. This is easily verified from the minimal 



model given in Fig. 2.3. Furthermore, completeness of the CLP scheme is easily verified from 
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the fact that for solutions a € [q(-X") Sz X = e]y and a' € [qPO & X = e]y , we can derive 
V -answers with a £ \X = ojy and a' € = 6]^. 



2.3 Constraint Logic Grammars 



In this section, we will explicate the concept of constraint logic grammars. To this end, we will 
restrict our attention to feature-based CLGs and discuss in particular the main properties of 
an HPSG instance of such grammars. 

We will show how a feature-based constraint language can be obtained from a feature- 
based logical description language, and how such a constraint language can be embedded into 
the CLP scheme of Hohfeld and Smolka (1988| ), yielding a feature-based CLG. The language 
to be discussed is that of Gotz (1995|) , |Got^ (to appear), which is close to that of King (1989| ), 
King (1994] ) (modulo the usage of variables) and Smolka (1988), Smolka (1992) (modulo 
appropriateness conditions). This language provides a description language J-T> specifying 
the logical foundations of HPSG grammars and is extendable to a constraint language J-C, 
in the sense of Hohfeld and Smolka (1988 ). The expressive power of the language is smaller 
than or equal to the expressive power of first-order predicate logic with equality. 



2.3.1 A Feature-Based Constraint Language 

The language is based on a notion of signature, i.e., the non-logical elements of the alphabet, 
declaring the structures the linguist is interested in. A signature specifies a set of feature 
symbols, a lattice of sort symbols and appropriateness conditions restricting the functional 
properties of the feature symbols. All subsequent work should be understood with respect to 
an implicit signature X. 

Definition 2.10 (Signature). A signature is a quadruple {T ,^,J- \approp) s.t. 

• (T, X) is a finite join-semilattice of types, 

• S = {t € T\ if t' -<t then t' = t} is a finite set of minimal types, 

• J- is a finite set of feature symbols, 

• approp : S x J- — 1 T is a partial function from pairs of minimal types and features to 
types. 

The well-formed formulae of the feature-based description language TT>, called feature 
descriptions, are built from the symbols in the signature, a countably infinite set of variables 
VAR, the symbol : assigning features to their values, and the standard boolean connectives. 
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Expressions of this kind can be seen as the formal equivalent of the AVM notation used in 



Pollard and Sag (1994 ). The set Desc of feature descriptions is defined as follows. 

Definition 2.11 (Feature descriptions). The set Desc of feature descriptions is the small- 
est set s.t. 

• X is a description if X G VAR, 

• t is a description if t G T, 

• f:D is a description if f G T , D G Desc, 

• D\ A Z?2, D\ V Z?2, ~>Di, D\ — > Z?2 are descriptions if D\ G Desc, Z?2 G Desc. 



An interpretation of a signature is based on an arbitrary domain of objects, and assigns to 
every object exactly one minimal type, and to every feature symbol a partial function on the 
domain. The domains and ranges of these functions are determined by the approp function. 
This function specifies that for each object u of a minimal type s, there is a connected object 
F(f)(u) defined iff approp(s, f) is defined, and the type S(F(f)(u)) of this connected object 
has to be appropriate. 

Definition 2.12 (Interpretation). An interpretation is a quadruple 2 = (U, S, F) s.t. 

• U is a set of objects, the domain of I, 

• S: U — > S is a total function from the domain to the set of minimal types, 

• F: T — > is a is a total feature interpretation function s.t. 

1. for each u G U, for each f G T, if approp (S(u), f) is defined and 
approp(S(u), f) = t, then F(f)(u) is defined and S(F(/)(n)) ^ t, 

2. for each u G U, for each f G T, ifV(f)(u) is defined, then 
approp(S(u), f) is defined and S(F(/)(n)) ■< approp(S(u), f). 

The denotation of feature descriptions with respect to an interpretation I and a variable 
assignment a is defined to be a subset of the domain for every feature description. By ab- 
stracting away from the variable assignment, we arrive at a concept of abstract denotation 
comprising the denotation of a feature description under every possible variable assignment. 

Definition 2.13 (Variable assignment). A variable assignment a : VAR — > U is a total 
function from the set of variables to the domain. Write ASS for the set of variable assignments. 



Definition 2.14 (Feature description denotation). 



2.3 Constraint Logic Grammars 
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{Xf a = {a(X)} if X e VAR, 

{tf a = {ue U| S(u) <t} ifteT, 

= {u € U| F(f)(u) is defined, F(/)(«) € iffeT,De Desc, 

[£>i A Z> 2 ]* = PiK n [D 2 K */ D 2 € Desc, 

px V Z> 2 K = \Dif a U [L> 2 ]* i/ D l ,D 2 £ Desc, 

N?i]£ = U\ i/^GDesc, 

[Dx - D 2 ]J = (U \ piK) U [£>a]i i/Dx, D 2 G Desc. 

Definition 2.15 (Abstract denotation). 

\Df= U PK ^/^GDesc. 
Q eASS 

To obtain a feature-based constraint language J-C fulfilling the closure requirements on 



constraint languages stated by Hohfeld and Smolka (1988 ), first we simply have to attach every 



feature description D in TT> with a new variable not occuring in the set V(D) of variables in D. 
This avoids accidental variable sharing and guarantees renaming closure of J-C. Furthermore, 
an explicit definition of conjunction of feature constraints ensures intersection closure of J-C. 

Definition 2.16 (Feature constraints). 

• X = D is a constraint if X € VAR, X V(L>), D £ Desc, 

• (ft & (ft' is a constraint if (ft, (ft' are constraints. 

The denotation of a constraint is defined by a function mapping every constraint to a 
set of variable assignments, called solutions. The solutions of a constraint X = D are the 
variable assignments in ASS which constrain the value of the variable X to the objects in 
the denotation of D. The denotation of a conjunction of constraints is the intersection of the 
respective denotations. 

Definition 2.17 (Feature constraint solutions). 

• [X = Df = {a £ ASS| a(X) £ [D]%} if X £ VAR, X £ V(D), D £ Desc, 

• \(ft k (ft'} 1 = {(ft} 1 n {(ft'} 1 if (ft, (ft' are constraints. 

Next we have to consider the problem of deciding satisfiability of feature descriptions and 
feature constraints. 

Definition 2.18 (Satisfiability of feature descriptions). A feature description D is sat- 
isfiable iff there is an interpretation T s.t. {D} 1 ^ 0. 
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This problem has been shown to be decidable for feature-based description languages 
closely related to the above reported one. For the description language reported above, a 
decision algorithm is given by |Gotz (to appear), for the variable- free notational variant of 



King (1994 ) by |Kcpser (1994 ), for a less expressive version of the language not employing 



appropriateness conditions by (Smolka (1988 ), [Smolka (1992j ), or for an even less expressive 



version employing conjunction as only boolean operator by |A'it-Kaci, Podelski, and Goldstein 



(1993|) 



Most of these approaches adapt for satisfiability checking a constraint solving method 
similar to that of [Smolka (1988 ), |Smolka (1992 ). This method is a three-step transforma- 



tion process from feature descriptions to a solved form of feature constraints displaying 



(un) satisfiability Following Gotz (to appear), constraint solving for the feature-based con- 
straint language reported above can be illustrated as follows: Firstly, every feature description 
is transformed to disjunctive normal form; secondly, every feature description in disjunctive 
normal form is transformed into a (disjunctively interpreted) set of (conjunctively interpreted) 
sets of feature constraints of the simple form X = Y, X = —>Y, X = t oi X = f : Y; thirdly, 
every such set of sets of simple feature constraints is transformed into a set of sets of feature 
constraints in solved form. 

For reasons of readability, we will consider the constraint solver for the feature-based 
constraint language TL in the following as a black box. The interested reader is referred for 
details and proofs to Gotz] (to appear). In all subsequent examples, we will depict only the 



result of constraint solving, re-translated from simple feature constraints in solved normal 



form to feature constraints in a more readable form according to Definition 2.1£. 
The notion of satisfiability defined for feature constraints is as follows. 

Definition 2.19 (Satisfiability of feature constraints). A feature constraint (ft is satis- 
fiable iff there exists an interpretation 2 s.t. l^J 1 ^ 0. 

Since every feature constraint is satisfiable whenever the embedded feature description is 
satisfiable, and since satisfiability of feature descriptions is decidable, we get immediately the 
desired decidability result for the feature-based constraint language TL. To sum up, since 
TL is closed under renaming and intersection, and due to the decidability algorithm for TC 
constraint solving of Gotz] (to appear), we can state the following proposition. 



Proposition 2.7. T C is a decidable constraint language closed under renaming and inter- 
section. 



2.3.2 Feature-Based Constraint Logic Grammars 



Feature-based grammars can be built in a pure declarative way simply as sets of axiomatic 
interpreted feature descriptions from the feature description language TV. 



2.3 Constraint Logic Grammars 
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Definition 2.20 (Grammar). A feature-based grammar Q is a finite set of feature descrip- 
tions s.t. Q C Desc. 

The feature descriptions comprising a grammar constrain the admissible models of the 
grammar in that in every model of a grammar every feature description must be true of every 
object. 

Definition 2.21 (Model). A model of a feature constraint grammar Q is an interpretation 
2 = (U, S, F) s.t. for every u S U, for every D G Q : u € {D} 1 . 

The central problem of prediction can then be defined model-theoretically as a relation 
between grammars and feature descriptions encoding the questioned input. 

Definition 2.22 (Prediction). A feature description D is predicted by a grammar Q iff 
there is a model 1 ofQ s.t. \Df ^ 0. 

In contrast to this definition, the linguistic problem of grammaticality is sometimes con- 
sidered as a relation between grammars and objects. As we will see below, the syntactic 
coding of Def. 2.22j enables a connection of the model-theoretic concept of prediction with 



the implement ational parsing/generation problem. The problem of prediction has shown to 



be undecidable for various feature-based description languages (see Ait-Kaci, Podelski, and 



|Goldstein (1993; ), jSmolka (1992j ), pot^ (to appear)). As shown by Got^ (to appear), decidable 



fragments of such languages are obtainable, e.g., in the form of grammars fulfilling the finite 
model property. 

Definition 2.23. A grammar Q has the finite model property iff for all descriptions D, 
Q predicts D iff Q has a finite model X s.t {D} 1 ^ 0. 

Note that even if for grammars having the finite model property the prediction problem 
is decidable, it is undecidable if a grammar has the finite model property of not. Thus it has 
to be kept in mind that decidability of the prediction problem for linguistically interesting 
CLGs is based on an assumption of finiteness of linguistic structures. 

To obtain feature-based CLGs from feature-based grammars, the feature descriptions from 
TT> have to be extended to feature constraints from which then can be embedded as 
JT£-constraints into a suitable definite clause specifcation in TZ{J-C). An example for such an 
embedding of a feature-based grammar into the CLP scheme of [Hohfeld and Smolka (198§| ) 
is given below. The resulting feature-based CLG can be seen as a notational variant of a 



CUF- grammar ( Dorre and Eisele 1991 ; Dorre and Dorna 1993 ). Alternatively, when replacing 



the predicates of this feature-based CLG by a single predicate gram in all clauses, we arrive 
at a program which would result from a direct application of the compilation algorithm of 
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|Gotz (1995 ), Gotz and Meurers (1995| ) to a feature-based grammar^. This compilation scheme 



connects the model-theoretic concept of prediction with the logic programming concept of V- 
answer directly. This is done by an automatic generation of a 7?.(JF/3)-program V for every 
^-"D-grammar Q, where the program defines an unary relation gram encoding prediction. This 
encoding is said to be correct under the following conditions. 

Definition 2.24. Let V be a definite clause specification in TL defining the relation gram, 
and let Q a grammar from J-T>. Then V is a correct translation of Q iff 

Q predicts feature description D iff the goal gram(X) & X = D has a V -answer. 

The compilation scheme presented by Gotz (1995 ) is sound and for a large class of gram- 



mars complete. A sufficient condition to receive correct translations in the sense of Def. 2.24 
is again the finite model property. Thus under the assumption that linguistic structures are 
finite, CLP can be seen as a useful parsing scheme for linguistically interesting feature-based 
CLGs. 

Let us illustrate these concepts with an example. Suppose a simple grammar licensing, 
among others, analyses such as 



[Peter believes [Clinton^ talks 



vis Is 
or 

[Peter believes [Clinton^ talks n]np ]s- 

We will define now a feature-based grammar presenting a .FD-encoding of the part of this 
grammar which is relevant for the structural ambiguity. It is a modified and extended version 
of an example from |Carpenter (1992]) . 

The signature comes with a type hierarchy with top element T, feature symbols, and 



appropriateness conditions, and is depicted in the graph in Fig. [2^. Feature symbols are de- 
picted in SMALL CAPS font, type symbols in lower case italics, and appropriateness conditions 
are expressed in a matrix notation, reading, e.g., approp(phrase, DTR1) = sign. 

The relevant ^^-descriptions are given in Fig. |2.6| . The first implication encodes the rules 
S — > N V and NP — > N N. Context-sensitivity is introduced by the agreement requirement 
on the first rule. The second implication encodes the rules N — > Clinton, V — » talks, and 
N -> talks. 



3 Based on a differentiation of types in distinct sets according to whether and how they appear as antecedents 
of grammar constraints, this compilaton procedure introduces a set of clauses defining the single predicate gram 
for each such set of types. Actually, for the example given below, this compilation scheme would also produce 
a clause gram(A) <— X — t for each minimal type t of the grammar signature which is not the antecedent of 
a grammar description. For ease of readability, we will omit clauses introduced for (minimal or non-minimal) 
non-antecedent types in our example. 
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phrase 

DTR1 sign 

DTR2 sign 

CAT cat 

AGR agr 



word 

PHON basexpr 

CAT cat 

AGR agr 



np n 



basexpr 



Clinton talks sg pi 



Figure 2.5: Signature for feature-based grammar 

phrase -» (CAT : s A DTR1:CAT : n A DTR2:CAT : v A DTRlrAGR : Y) 
A DTR2:AGR : F) 

V (CAT : np A DTRlrCAT : n A DTR2:CAT : n) 
word -> (CAT : n A PHON : Clinton A AGR : sg) 

V (CAT : v A PHON : talks A AGR : sg) 

V (CAT : n A PHON : ta/A;s A AGR : pi) 



Figure 2.6: Feature-based grammar 



The CLG obtained from a simplified compilation of the grammar in Fig. |2.6| to a definite 
clause specification in T C is given in Fig. |2.7| . The embedded .F^C-constraints are depicted 
graphically in the same way as ^"P-descriptions. 7£(.F£)-atoms are depicted in typewriter 
font. 

Given this program and a goal 

X = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & sign(X) 

encoding the phrase Clinton talks, we can infer two answers 

X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg 
A DTR2 : word A DTR2: CAT : v A DTR2: PHON : talks 
A DTR1: AGR : Y A DTR1: AGR : sg) 

and 



X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2: AGR : pi) 
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1 phrasepf) <- X = {phraseACAT : sADTRl:CAT : nADTR2:CAT : uADTRl:AGR : 

Y A DTR2:AGR : Y A DTR1 : Z\ A DTR2 : Z 2 ) & sign(Zi) & sign(Z 2 ). 

2 phrase(X) «- X = {phrase A CAT : np A DTR1:CAT : n A DTR2:CAT : n A DTR1 : 





Zi A 


DTR2 : Z 2 ) & sign(Zi) & 


sign(Z 2 ). 




3 


word(X) 


<- A" = {word A CAT : n 


A PHON 


Clinton A AGR : sg) 


4 


word(X) 


«- X = {word A CAT : v 


A PHON 


talks A AGR : sg). 


5 


word(X) 


<- X = (word A CAT : n 


A PHON 


talks A AGR : pi). 


6 


sign(X) 


<— phrase(X). 






7 


sign(X) 


<— word(X). 







Figure 2.7: Feature-based constraint logic grammar 



encoding the parses [ClintonN talksy]s and [ClintonN talks n]np respectively. The parses 
are depicted in Figs. |2^ and \2.9[ Note that goal reduction and constraint solving are applied 
in one step. Furthermore, only success branches are depicted and the the constraint solver is 
viewed as a black box. 



2.4 Summary 



In this chapter we discussed the basic formal concepts of the CLP scheme of Hohfeld and 
|Smolka (1988| ). These concepts provide a formal specification of the notions of constraint 
language and of a constraint logic program embedding a constraint language. For convenience, 
we gave some missing proofs and introduced the notions of logical consequence, derivation 
tree and proof tree into the CLP scheme. These concepts will be useful in latter chapters. 

Furthermore, we reported the central formal details of feature-based CLGs and presented 
a simple linguistic grammar which will be used as a running example in the following chapters. 

Proof trees or parses in constraint-based NLP can be quite complex even if simple gram- 
mars are used to analyze two- word phrases as in the example given above. Clearly, for complex 
grammars and phrases of reasonable length, structural ambiguity in constraint-based NLP is 
a severe problem. The task of the next two chapters is to provide a rigorous mathematical 
foundation of ambiguity resolution in constraint-based NLP. 
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X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 

& sign(X) 

rjc 

6, X = (szc/n A DTR1: PHON : Clinton A DTR2: PHON : talks) 

& phrase(X) 

r,|c 

1, X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : Z x A DTR2 : Z 2 ) 

k sign(Zi) & sign(Z 2 ) 

r,|c 

7, X = (p/irase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : Z x A DTR2 : Z 2 ) 

& word(Zi) & sign(Z 2 ) 

rjc 

3, X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 

A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg 

A DTR2 : Z 2 ) & sign(Z 2 ) 

r,|c 

7, X = (p/irase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg 

A DTR2 : Z 2 ) & word(2' 2 ) 

rjc 

4, X = (p/irase A CAT : s A DTR1 : word A DTR1: CAT : n 

A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg) 

Figure 2.8: A derivation of [Clinton^ talksy]s 
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X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 

& sign(X) 

r,|c 

6, X = (sign A DTRl:PHON : Clinton A DTR2: PHON : talks) 

&: phrase(X) 

r,|c 

2, X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : n 

A DTR2: PHON : talks A DTR1 : Z x A DTR2 : Z 2 ) 
& sign(Zi) & sign(Z 2 ) 

r,|c 

7, X = (p/irase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : n 
A DTR2: PHON : talks A DTR1 : Z x A DTR2 : Z 2 ) 
& word(Zi) h sign(Z2) 

r,|c 

3, X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 

A DTR2: CAT : n A DTR2: PHON : talks A DTR2 : Z 2 ) 

& sign(Z 2 ) 

r,|c 

7,X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2 : Z 2 ) 

&; word(Z2) 

r,|c 

5, X = (p/irase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2: AGR : pi) 

Figure 2.9: A derivation of [Clinton^ talks^NP 



Chapter 3 

Quantitative CLP: Quantitative 
Inference with Subjective Weights 
and its Formal Semantics 



In this chapter we present a novel framework for quantitative inference with subjective weights 
for CLP. We show soundness and completeness of the quantitative system with respect to a 
simple and intuitive formal semantics. We illustrate these concepts with a simple quantitative 
CLG and show how pruning techniques can be used to guide the search for the highest 
weighted analysis in such quantitative systems. 

This chapter is based upon work previously published in Riezler (1996| ). 



3.1 Introduction and Overview 

Quantitative frameworks have been presented as extensions of both logic programming and 
constraint-based grammars. For the area of logic programmming, a system of quantitative 
deduction which is sound and complete with respect to a related fixpoint semantics was 



introduced firstly by van Emden (1986 ). Like this seminal approach, most of the subsequent 
work on quantitative extensions of logic programming has concentrated on theoretical issues 
such as questions of the expressivity of systems for quantitative logic programming, or issues 
of the correctness of the connection of model-theory, fixpoint-theory, and proof-theory for 
such systems. However, none on these approaches seemed to have a specific application in 
mind. 

On the contrary, quantitative extensions of constraint-based grammars have mainly been 
motivated by practical considerations. Most approaches in this area come as numerical ex- 
tensions of the parsing strategy of existing constraint-based frameworks. However, even if for 
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such systems the formal foundation of the underlying framework may be clear enough, none 
of these approaches comes with a well-defined semantics for its quantitative extension. That 
is, such quantitative extensions have to be seen as extralogical extensions of, e.g., the deduc- 
tion scheme of the underlying CLP framework, and are not related to the model-theoretic 
counterpart of this operational semantics. 

This is clearly an undesirable state of affairs. Rather, in the same way as CLGs provide 
a model-theoretic characterization of linguistic objects coupled with an operational parsing 
system, one would like to relate a quantitative deduction system to a quantitative model- 
theory in a sound and complete way. The aim of this chapter is to present a sound and 
complete system of quantitative CLP which satisfies the following conditions. It should 

• generally be applicable to CLP over arbitrary constraint languages, 

• provide a precise, but yet simple formal semantics for quantitative CLP deduction, 

• from the outset be designed with a specific application in mind, in our case, with respect 
to efficient ambiguity resolution in CLGs. 

The first point means that in quantitative CLP one should not have to bother about 
the peculiarities of the constraint languages embedded into the CLP scheme. Rather, the 
quantitative extension should work in the same way for every constraint logic program irre- 
spective of the embedded constraint language. For the NLP application, this means that for 
arbitrary constraint-based grammars a quantitative extension should be obtainable from the 
CLG resulting from an embedding of the grammar constraint language into a CLP scheme. 

The second point addresses the tradeoff between the expressive power of the quantitative 
system and the intuitivity and simplicity of its semantics. That is, since the aim of a formal 
semantics is to provide a precise unambiguous way to specify the meaning of all aspects 
of an operational system at the design and implementation stage, it is justified only by 
its understandability and applicability. Our approach respects these ideas of simplicity and 
elegance by using the simple concepts of fuzzy set algebra as a basis for a formal semantics 
for quantitative CLP. 

The third point, which refers to the intended application of ambiguity resolution and best- 
parse search in CLGs, is realized in quantitative CLP by stating the proof theory of quanti- 
tative CLP in terms of min/max trees, which in turn enables strategies such as alpha/beta- 
pruning to be used for efficient searching for best parses in CLGs. 

Clearly, generalizations of this specific choice of design for quantitative CLP should be 
straightforward. However, they will not made explicit in the following chapters. 

This chapter is organized as follows. Sect. |3.2| discusses previous work on quantitative logic 
programming and quantitative extensions of constraint-based grammars. 



3.2 Previous Work 



33 



Sect. 3.3 introduces the concept of a quantitative definite clause specification, i.e., a quan- 
titative constraint logic program. 



Sect. 3.4 introduces the declarative semantics of quantitative definite clause specifica- 
tions, i.e., a model-theoretic semantics based on concepts of fuzzy set algebra and a fixpoint 
semantics obtained by a minimal models in this model-theory. 



Sect. |3.5| presents the operational semantics of quantitative CLP. That is, based on the 
concepts of quantitative derivation trees and quantitative proof trees, soundness and com- 
pleteness of quantitative deduction in CLP is proven. 

Sect. [0] exemplifies these concepts with a quantitative feature-based CLG, and shows 
how the search technique of alpha/beta-pruning can be applied to quantitative CLGs. 



3.2 Previous Work 



For the area of logic programming, van Emden (1986 ) presented in a seminal paper a quan- 
titative deduction scheme and a fixpoint semantics for sets of numerically annotated Horn 
clauses. The aim of this paper was to enable the expression of a continuum of uncertainties 
between the usual two truth values in quantitative logic programs. The semantics of such 
quantitative logic programs is based upon concepts of fuzzy set algebra, and crucially deals 
with the truth- functional propagation of weights across conventional definite clauses. Van 
Emden's approach initialized research into a now extensively studied area of quantitative 
logic programming. For example, annotated logic programming ( [Subrahmanian (1987 ), Kifcr 
|and Subrahmanian (1992 )) extends the expressive power of quantitative rule sets by allow- 
ing variables and evaluable function terms as annotations. Furthermore, in annotated logic 
programs, annotations can be attached to atoms and their conjunctions or disjunctions, and 
such programs are interpreted in powerful frameworks of lattice-theoretic semantics. Depend- 
ing on different understandings of annotations, further extensions of van Emden ( 19861 ) 's and 
[Subrahmanian (1987| )'s approaches have been presented. Among those are approaches to pos- 



sibilistic logic programming based on subjective necessity values (Dubois, Lang, and Prade 
1991[) , probabilistic logic programming based on intervals of subjective probabilistic truth val- 



ues (see, e.g., Ng and Subrahmanian (1992; ), |Ng and Subrahmanian (1993j )), or probabilistic 
deductive databases based on subjective confidence levels coming as intervals of belief and 
doubt (see, e.g., Lakshmanan and Sadri (1994| ), [Lakshmanan and Sadri (1997] ) ) . 

Quantitative extensions of constraint-based grammars have mainly been motivated by 



practical considerations. For example, Douglas and Dale (1992 ) presented an approach to ro- 
bust parsing in PATR systems where according to a subjective value of necessity /optionality 
of constraints, constraint violations are allowed, and so robustness is introduced into the for- 
malism. Kim (1994]) presented an approach to best-first chart parsing with PATR grammars. 
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In this approach, atomic values of feature structures are annotated with subjective weights, 
and a weight combination scheme is defined for feature structure unification. The search space 
in best-first parsing then is restricted by a treshold below which completed and predicted fea- 
ture structures are discarded. Erbach (1993a| ), |Erbach ( 1993b] ), or [Erbach (1998|) introduced a 



model of preference for the CUF system, which is generalizable to the CLP scheme of Hohfeld 



and Smolka (198S| ), and which is used, among others, for tasks such as best-first parsing for 
ambiguity resolution and self-monitored generation. In Erbach's model, definite clauses as a 
whole are annotated with subjective preference values. Such preference values are combined 
in the resolution process by calculating the preference value of a clause consequent as the 
product of the preference value of the clause and the preference values of the antecedent 
predicates, which are additionally weighted to add up to 1. 

The aim of our approach is to combine the mathematical exactness of the logic- 
programming approaches with the practical applicability of the quantitative-grammar ap- 
proaches. We will build our framework of quantitative CLP on ideas developed in the simple 
and elegant framework of van Emden (198(f ). This means that we restrict our attention to 
numerical weights attached to CLP clauses as a whole, and use the simple concepts of fuzzy 
set algebra to provide the basis for an intuitive formal semantics for quantitative CLP. Fur- 
thermore, we employ a min/max scheme for rule application which enables strategies such as 
alpha/beta pruning to be used for efficient searching. Clearly, our approach improves upon 
van Emden's approach by not being restricted to Horn clauses or to finite derivations. More- 
over, it enables the application of quantitative search strategies to constraint-based grammars 
in a formally well-defined way. 



3.3 Syntax of Quantitative CLP 



Building upon the CLP scheme of Hohfeld and Smolka (198S| ) reported in Chap. [2|, we can 



define the syntax of a quantitative definite clause specification Vf very quickly. The following 
definitions are made with respect to implicit constraint languages C and 11(C) ■ A definite 
clause specification V in 1Z(C) then can be extended to a quantitative definite clause specifi- 
cation Vp in H(C) simply by adding numerical factors to program clauses. 

Definition 3.1 (Vf )• A quantitative definite clause specification Vf in 1Z(C) is a finite set 
of quantitative formulae, called quantitative definite clauses, of the form 

(f) & B x & . . . k B n f -» A, 

where A, B%, . . . , B n are 1Z(C) -atoms, cj) is an C -constraint, n > 0, / € (0, 1] . We may write 
a quantitative formula also as A ^- f </> &c Bi &c . . . & B n . 



These factors (the / in Definition |3.1| ) should be thought of as abstract weights which 
receive a concrete interpretation in specific instantiations of Vf ■ 
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In the following the notation 11(C) will be used more generally to notate relationally 
extended constraint languages which possibly include quantitative formulae of the above form. 



3.4 Declarative Semantics of Quantitative CLP 

3.4.1 Fuzzy Set Algebra and Model-Theoretic Semantics 

To obtain a formal semantics for Vf , first we have to introduce an appropriate quantitative 
measure into the set-theoretic specification of 1Z(C) -interpretations. One possibility to obtain 
quantitative 1Z(C) -interpretations is to base the set algebra of 11(C) -interpretations on the 
simple and well-defined concepts of fuzzy set algebra (see Zadeh (1965|) ). 



Relying on Hohfeld and Smolka's specification of base equivalent 1Z(C) -interpretations, 
i.e., 1Z(C) -interpretations extending the same C -interpretation, in terms of the denotations 
of the relation symbols in these interpretations, we can "fuzzify" such interpretations by 
regarding the denotations of their relation symbols as fuzzy subsets of the set of tuples in the 
common domain. 

Given constraint languages C and 1Z(C) , we interpret each n-ary relation symbol r € 1Z 
as a fuzzy subset of T> n , for each 1Z(C) -interpretation A with domain T>. That is, we identify 
the denotation of r under A with a total function 

»(_;r A ):V n ^[0,l], 

which can be thought of as an abstract membership function. Such membership functions are 
generalized characteristic functions, and classical set membership is coded in this context by 
characteristic functions taking only and 1 as values. 

Next, we have to give a model-theoretic characterization of quantitative definite clauses. 
Clearly, any monotonous mapping could be used for the model-theoretic specification of the 
interaction of weights in quantitative definite clauses and accordingly for the calculation of 
weights in the proof-theory of quantitative CLP. For concreteness, we will instantiate such a 
mapping to the specific case of Definition |3.2| resembling van Emden (1986| )'s mode of rule 



application. This will allow us to state the proof-theory of quantitative CLP in terms of 
min/max trees which in turn enables strategies such as alpha/beta pruning to be used for 



efficient searching. Such a quantitative CLP scheme improves upon several shortcomings of [van 



Emden (1986)'s system, e.g. our quantitative CLP scheme clearly is not restricted to ground 
instances of Horn theories, and the soundness and completeness results we will present are 
not restricted to finite derivations. However, the choice of the mode of rule application made 
is not crucial for the substantial claims of this paper, and generalizations of this particular 
combination mode to specific applications should be straightforward, but are beyond the 
scope of this thesis. 
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The following definition of model corresponds to the definition of model in classical logic 
when considering only clauses with / = 1 and mappings T> n — > {0, 1}. 

Definition 3.2 (Model). An 71(C) -interpretation A extending some C -interpretation 1 is 
a model of a quantitative definite clause specification Vf iff f or each a £ ASS , for each 
quantitative formula r(x) <— / 4> k q\(x\) k . . . & qk(xk) in Vf holds: 

If a € {(f)} 1 , then fi(a(x);r A ) > f x mm{fi(a(xj);q A )\ 1 < j < k}. 

In terms of membership degrees, this definition of model can be paraphrased as follows: If 
the antecedent constraint is satisfiable, then the membership degrees of the denotations of the 
consequent atom must not be less than / times the membership degrees of the denotations of 
the antecedent atom. A truth-functional view could be obtained by considering membership 
degrees as truth degrees of atoms under variable assignments. From the viewpoint of such 
a truth- functional propagation of weights across definite clauses, a clause contributes to the 
consequent a truth value which is / times the truth value of the antecedent. 

Note that the notation of an 1Z(C) -interpretation A will be used more generally to include 
interpretations of quantitative formulae. TZ(C) -solutions of a quantitative formula are defined 
as [r(x) <- f <f>kqi(xi) k ... k q k (x k )] A = {a e ASS | If a G [0] 2 ', then n(a(x); r A ) > 
f x mm{n(a(xj); q A )\ 1 < j < k}}. 

Based on the above definition of model, the concept of logical consequence can be defined 
as usual. 

Definition 3.3 (Logical consequence). A quantitative formula r(x) <—f(f)isa logical con- 
sequence of a quantitative definite clause specification Vf iff for each 7Z(£) -interpretation A , 
A is a model ofVF implies that A is a model of \r(x) <— f (f)}. 

Furthermore, we have that the fact that r(x) <^f <p is a logical consequence of Vf implies 
that r(x) *—f'<j) is a logical consequence of Vf for every f < f . 

A goal G is defined similar to the non-quantitative case as a (possibly empty) conjunction 
of 1Z(C) -atoms and C -constraints. We can, without loss of generality, restrict goals to be of 
the form r(x) & <f>, i.e., a (possibly empty) conjunction of a single relational atom r{x) and an 
/^-constraint <f>. This can be done since for each goal G = r±(xi) & . . . & r k {xk) & 4> which 
contains more than one relational atom, we can complete the program with a new clause 
C = r(x\, . . . , Xk) <— i ri(x±) & ... & r^(xfc) & (f>, with G as antecedent and a new predicate, 
which takes all variables in G as arguments, as consequent. Submitting the new predicate 
r(xi, ... , Xk) as query yields the same results as would be obtained when querying with the 
compound goal G. 

Given some program Vf and some goal G, a quantitative Vf -answer (p of G is defined 
as a satisfiable C -constraint ip s.t. ip f — > G is a logical consequence of Vf ■ A quantitative 
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formula p f — ► r(x) &; </> is denned to be a logical consequence of Vf iff every model of Vf is 
a model of {ip / — > r(x) & ^}. An 72.(£) -interpretation .4. is a model of {(/? /— > r(x) & 0} iff 
M" 4 ^ and ^4 is a model of {r(x) *—f ip}. 

Next we have to associate a complete lattice of interpretations with quantitative definite 
clause specifications. 

Adopting Zadeh's definitions for set operations, we can define a partial ordering on the 
set of base equivalent 1Z(C) -interpretations. This is done by defining set operations on these 
interpretations with reference to set operations on the denotations of relation symbols in these 
interpretations. We get for all base equivalent 1Z(C) -interpretations A, A': 

• A C A ' iff for each n-ary relation symbol r <G 1Z , for each a G ASS , for each x G VAR ra : 
n(a(x);r A ) < fi(a(x); r- 4 '), 

• A = (J X iff for each n-ary relation symbol r € 1Z , for each a € ASS , for each x € VAR n : 
//(a(x); r- 4 ) = sup{ / u(a(rc); r- 4 ')) *4' £ X}, 

• .4 = P| X iff for each n-ary relation symbol r &1Z , for each a G ASS , for each x € VAR n : 
^(aOr);/" 4 ) = mf{n,(a(x);r A ')\ A ' G A}. 

Note that we define furthermore sup = 0, inf = 1. Clearly, the set of all base equivalent 
1Z(£) -interpretations is a complete lattice under the partial ordering of set inclusion. The 
supremum is given by the union, and the infimum by the intersection, for any set of base- 
equivalent 11(C) -interpretations. The top element is the 1Z(C) -interpretation A T such that 
for each r G 1Z , for each u G D Ar ( r ); / u(n;r' 4T ) = 1, and the bottom element is the 11(C) - 
interpretation A 1 - such that for each r G 1Z , for each u G D Ar(r ): [i(u; r A± ) = 0. 

3.4.2 Minimal Model Semantics 

Based upon the definition of a complete lattice of 11(C) -interpretations of a quantitative def- 
inite clause specification Vf , we can state the following equations, which link the declarative 
and operational semantics of Vf • These equations define the notion of a Vf -chain, which will 
be crucial for the construction of minimal models for Vf ■ Similar to the non-quantitative case, 
these equations are based on the respective definition of model, and take for the quantitative 
case the following form. 

Definition 3.4. Let Vf be a quantitative definite clause specification in 1Z(C) , 1 be an C - 
interpretation. Then the countably infinite sequence {Ao, Ai, A2, ■ ■ ■) ofH(C) -interpretations 
extending 1 is a Vf -chain iff' for each n-ary relation symbol r G 1Z , for each a G ASS , for 
each x G VAR": 



fi(a(x); r Ao ) := 0, 
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fi(a(x);r Ai+1 ) := max{/ x mm{[i(a(xj); q Ai )\ 1 < j < n} \ there is a variant r(x) <j) & 
q\{xi) Sz . . . & q n (x n ) of a clause in Vf and a G [</>]"^}- 



Before turning to the construction of minimal models, we have to prove the following 



useful lemma (see yan Emden (1986 ), Lemmata 2.10', 2.11'). Lemma 3J assures that for each 
tuple of objects in the denotation of a relation symbol under a minimal model, there is a 
corresponding finite step in the Vf -chain which introduces these objects into the minimal 
model denotation. 

Lemma 3.1. For each Vf , for each Vf -chain (Aq, Ai, A2, ■ ■ ■ ), for each k-ary relation 
symbol r G TZ , for each a G ASS, for each x G VAR fc , there exists some n G EST s.t. 
//(a(f);r lJi > ' 4i ) = u(a(x); r An ). 

Proof. We have to show that the supremum v = sup{/i(a(af); r Ai ) \ i > 0} can be attained for 
some n G IN. 



v = 0: For v = 0, we have n = 0. 

v > 0: For v > 0, we have to show that for any real e, < e < v, the set {//(a(x); r Ai )\ i > 
and n(a(x); r Ai ) > e} is finite. 

Let F be the finite set of real numbers of factors of clauses in Vf , m be the greatest 

element in F s.t. m < 1 and let q be the smallest integer s.t. m q < e. 

Then, since each real number u(a(x);r Ai ) is a product of a sequence of elements of F, 

the number of different products > e is not greater than \F\ q , the permutation of \F\ 

different things taken q at a time with repetitions, and thus finite. 

Hence, the supremum is the maximum attained for some n G IN. □ 



Now we can obtain minimal model properties for quantitative definite clause specifications 
similar to those for the non-quantitative programs of Hohfeld and Smolka (1988 ). Based on the 
constructive definition of a Vf -chain of Tt{C) -interpretations extending an C -interpretation 
I , an 1Z(C) -interpretation A is obtainable as the 1Z(C) -interpretation which is both a model 
of Vf and minimal with respect to the lattice of base equivalent 1Z(C) -interpretations ex- 
tending X . Theorem 3^ states that we can construct a minimal model A of Vf for each 
quantitative definite clause specification Vf in the extension of an arbitrary constraint lan- 
guage C and for each C -interpretation. This means that — due to the definiteness of Vf — we 
can restrict our attention to a minimal model semantics of Vf ■ 



Theorem 3.2 (Definiteness). For each C -interpretation I , for each quantitative definite 
clause specification Vf in 1Z(C) , for each Vf -chain (Aq, Ai, A2, • • • } oflZ(C) -interpretations 
extending some C -interpretation Z : 
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(i) A0QA1Q..., 

(ii) the union A : — Uj>o^* ^ s a m °del ojT'p extending X ; 

(iii) A is the minimal model ofVp extending X . 

Proof, (i) We have to show that A f= A+i- We prove by induction on i showing for each 
constraint language C , for each quantitative definite clause specification Vf in TZ(C) , for each 
C -interpretation X , for each Vf -chain (Aq, A\, A2, ■ ■ ■) of 1Z(C) -interpretations extending 
some C -interpretation X , for each n-ary relation symbol r G 1Z , for each a G ASS , for each 
x G VAR n , for each i G IN: fi(a(x);r At ) < fj,(ot(x);r Ai+1 ). 

Base: fi(a(x); r Ao ) = < fj,(a(x); r Al ). 
Hypothesis: Suppose /i(a(x); j-- 4 ™- 1 ) < /i(a(x); r- 4 ™). 
Step: /j,(a(x); r An ) = v>0 

==> there exists a variant r(x) <^f (ft &: gi(xi) & . . . k, qk(x k ) of a clause in "Pf s.t. v = 
f x min{/i(a(xi); gi- 4 ' 1 - 1 ), . . . , n(a(x k ); (/ft" 4 ' 1 " 1 )} and a G [0]" 4 ™ -1 , by Definition 

El 

=> /i(a(fi);gi Al ) > ^(q^i);^- 4 "- 1 ),... , fj,(a(x k ); q k An ) > n(a(x k ); q^- 1 ) and a G 
[(/•I" 4 ™, by the hypothesis 

==> n(a(x);r An+1 ) > v, by definition of fi(a(x);r Ai+1 ) 
=> / u(o(x);r- 4n ) < /i(a(x); r- 4 ^ 1 ). 

For v = it follows immediately that /i(a(x); r A ") < fi(a(x);r An+1 ). 
Claim (i) follows by arithmetic induction. 

(ii) We have to show that A := Ui>o ^ * s a m °d e l of 'Pf extending X . We prove that for 
each clause r(x) <— f <fi & qi{x\) & ... & q k {x k ) m Pf 1 for each a G ASS : If a G [^l" 4 , then 
r -4 ) > / x min{^(o;(xj-); gj" 4 )! 1 < j < k}. 

Note that since every A% is an 1Z(C) -interpretation extending X , A is an 1Z(£) -interpretation 
extending X . 

Now let r(x) gi(xi) &: . . . & q k (x k ) be a clause in TV s.t. for some a G ASS : 

a G m A and fj,(a(xi); qi A ) = mm{fi(a(xj); qj A )\ 1 < j <k} = v. 

Then there exists some n G IN s.t. v = fi(a(xi); qi An ) = mm{/j,(a(xj); qj An )\ 1 < j < k}, by 
Lemma [O] and since for all j s.t. 1 < j < k : fi(a(xj); qj A ) = sup{/i(a(xj); qj Ai )\ i > 0} 
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=> /x(a(x); r An+1 ) > f x v, by Definition [3^ 

==> /i(a(x);r- 4 ) > / u(a(x); r' 4 ^ 1 ), since /i(a(x);r A ) = sup{//(a(x); r^ 1 )! i > 0} 

==> /x(a(x);r^) > / X min{//(a(xj); Qj" 4 )! 1 < j < A;}. 
This completes the proof for claim (ii). 

(iii) We have to show that A is the minimal model of Vf extending X . We prove for every base 
equivalent model B of Vf ■ Ai Q B, which gives A C B, by induction on i showing for each 
constraint language £. , for each quantitative definite clause specification Vf in 71{C) , for each 
C -interpretation 1 , for each Vf -chain (Aq, A\, A2, ■ ■ ■) of TZ(C) -interpretations extending 
some C -interpretation 2 , for each n-ary relation symbol r G TZ , for each a G ASS , for each 
x G VAR n , for each i G IN: ^(a(x);r A ) < / u(a(f);r B ). 

Base: fi(a(x); r Ao ) = < ^(a(x); r B ). 
Hypothesis: Suppose fi(a(x); r- 4 ™- 1 ) < ^(a(x);r B ). 
Step: fi(a(x); r An ) = v>0 

==> there exists a variant r(x) <^f (ft & tzi(xi) & ... & qk(xk) of a clause in 7^ s.t. v = 
/ x mm{fi(a(xi); qi^' 1 - 1 ), . . . , n(a(xk); q^"' 1 )} an d a G l^]" 4 ™" 1 , by Definition 
£4 



=> /i(a(fi);gi e ) > /i(a(fi);gi Al - 1 ),... , fj,(a(x k ); q k B ) > fi(a(x k ); QV 4 ™- 1 ) and a G 
[<^>] s , by the hypothesis 

=> fi(a(x); r B ) > v, since B is a model of Vf 
=>■ n(a(x)] r An ) < n(a(x);r B ). 

For u = it follows immediately that fj,(a(x); r An ) < /i(a(x);r 8 ). 
Claim (iii) follows by arithmetic induction. □ 

The following proposition allows us to link the declarative description of the desired output 
from Vf and a goal, i.e., a quantitative Vf -answer, to the minimal model semantics of Vf ■ 



That is, Proposition [3^ shows that quantitative Vf -answers are completely characterized by 
minimal models of Vf ■ Similar to the non-quantitative case, this is done for the quantitative 
case by connecting the concept of logical consequence with the concept of minimal model. 

Proposition 3.3. Let Vf be a quantitative definite clause specification in 1Z(C) , tp be an 
C -constraint and G be a goal. Then ip „— > G is a logical consequence o/Vf iff every minimal 
model A of Vf is a model of {p t, — > G}. 
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Proof. If: For each minimal model A of Vf ■ A is a model of {(p v — > G} 

==> for every model B of Vf base equivalent to some minimal model A of Vf ■ B is a 
model of {99 v —> G}, since .A C B by Theorem [T^, (hi) 

==> ip v — > G is a logical consequence of "Pf . 
Only if: if v — > G is a logical consequence of "Pi? 



every model of TV is a model of {99 v — > G}, by Dehnition 3.3 



==> A is a model of {1^ G}. □ 
The following example illustrates the basic concepts of the declarative semantics of quan- 



titative definite clause specifications. The program of Fig. 3.1 is a quantitative version of the 



program of Fig. 2.1. The factors attached to clauses 2 and 3 express a preference of the C- 
constraint X = a over the C -constraint X = b in the definition of the predicate p. Predicate 
q is defined uniquely in clause 1 and gets assigned the factor 1. 

lq(X) <-! p(X). 

2 p(X) <-. 7 X = a. 

3 p(X) <-. 5 X = b. 

Figure 3.1: Quantitative constraint logic program 



The construction of a minimal model for the program of Fig. 3.1 is shown in Fig. |3.2| . For 
a variable assignment a G \X = a} x , the membership value of .7 of the object (a(X)) in the 
denotation of the predicate p (resp. q) under the minimal model A is obtained in step 1 (resp. 
step 2) of the Vf -chain construction. For a variable assignment a E \X = bf 1 , a membership 
degree of .5 is obtained in similar manner. 



Clearly, A = Ui>o ^ ^ s a minimal model of the quantitative program of Fig. 3A_ 



3.5 Operational Semantics for Quantitative CLP 
3.5.1 Min/Max Trees and Quantitative Proof Trees 

The proof procedure for quantitative CLP can be stated conveniently as a search of a tree, 
corresponding to the search of an SLD-and/or tree in conventional logic programming or to 
the search of a derivation tree as defined in Chap. § for CLP. The structure of such a tree 
exactly mirrors the construction of a minimal model and thus may be defined as a min/max 
tree. That is, according to the minimal model construction, which is based on the operations 
min and max, a min /max tree combines the standard left-right selection and depth- first search 
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a £ \X = af: 
M «a(X));pA) =0 , 

n((a(X)) ;pA) = max{.7 x min0} = .7, 
n({a(X)) ;pA) = max{.7 x min0} = .7, 

fi((a(X)) ; pU*>o A) = sup{0) -7> .7, . . . } = .7, 



Ai«a(X));qA>) = 0, 
/i(HX));qA) =0 , 

Ai((a(X)) jq- 42 ) = max{l x min{.7}} = .7, 
MHX)) ; qU,>o A } = sup{0) , .7, . . . } = .7. 



Q£[I = 6f: 
Ai ((a(X));pA) = , 

fi({a(X)) ;pA) = max{.5 x min0} = .5, 
fi((a(X)) ;pA) = max{.5 x min0} = .5, 

M<«PO> 5 P U ^° A ) = sup{0, .5, .5, . . . } = .5, 



/i((apO>;qA.) =0 , 
/i(HX));qA) =0 , 

/x((a(X)) ;qA) = max{l x min{.5}} = .5, 

M(«PO> ; q u ^° A ) = su P {o, o, .5, • • • } = -5, 



Figure 3.2: Vf -chain for quantitative constraint logic program 



3.5 Operational Semantics for Quantitative CLP 



43 



with a min/max calculation of node- values. A relation node of a derivation tree corresponds 
in the quantitative case to a max-node, and a constraint node to a min-node. In contrast to 
derivation trees, in min/max trees the unique successor of a constraint node is split up into 
several successor nodes, one for each relational atom in the goal. This is necessary to calculate 
a minimum of node values at a min-node. 

In the following we will assume implicit constraint languages C and TZ(C) and a given 
quantitative definite clause specification Vf in Tt{C) . Furthermore, V will denote the finite 
set of variables in the query and the V-solutions of a constraint 4> in an interpretation 1 are 
defined as [</>]y := {a|v| ct € {(j)} 1 } and a\y is the restriction of a to V. 

Definition 3.5 (Min/max tree). A min/max tree determined by a query G\ and a quan- 
titative definite clause specification Vf has to satisfy the following conditions: 

• Each max-node is labeled by a goal. The value of each nonterminal max-node is the 
maximum of the values of its successors. 

• Each min-node is labeled by a clause from Vf and a goal. The value of each nonterminal 
min-node is f x m, where f is the factor of the clause and m is the minimum of the 
values of its successors. 

• The successors of every max-node are all min-nodes s.t. for every clause C with — - 
resolvent G' obtained by C from goal G in a max-node, there is a min-node successor 
labeled by C and G' . 

• The successors of every min-node are all max-nodes s.t. for every H(C) -atom r(x) in 
goal G Ik 4> $z <ft' in a min-node with -—> -resolvent GSz<f>", there is a max-node successor 
labeled by r(x) h 4>" . 

• The root node is a max-node labeled by G\. 

• A success node is a terminal max-node labeled by a satisfiable C -constraint. The value 
of a success node is 1. 

• A failure node is a terminal max-node which is not a success node. The value of a failure 
node is 0. 

Similar to the non-quantitative case, a proof tree in the quantitative case is a subtree of a 
derivation tree. However, in a quantitative proof tree, each min-node takes all of the successors 
of the min-node of the min/max tree as its successors. Furthermore, to check the consistency 
of the constraint solving results in the min-node successors, an additional — —> -step has to be 
applied to the conjunction of all success nodes of a quantitative proof tree. This step yields a 
satisfiable C -constraint, called answer constraint, if the conjunction of the C -constraints in 
the success nodes is satisfiable. 
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Definition 3.6 (Quantitative proof tree). A quantitative proof tree for a goal G\ from 
quantitative definite clause specification Vf is a subtree of a min/max supertree determined 
by G\ and Vf and defined as follows: 

• The root node of the proof tree is the root node of the supertree. 

• A max-node of the proof tree is a max-node of the supertree and takes one of the suc- 
cessors of the supertree max-node as its successor. 

• A min-node of the proof tree is a min-node of the supertree and takes all of the successors 
of the supertree max-node as its successors. 

• All terminal nodes in the proof tree are success nodes 4>, 4>', . . . s.t. <f> & <f>' Sz . . . —> tp 
and ip is a satisfiable C -constraint, called answer constraint. 

• Values are assigned to proof tree nodes in the same way as to min/max tree nodes. 
3.5.2 Soundness and Completeness 

To prove soundness and completeness of the generalized SLD-resolution proof procedure de- 
fined via min/max trees and quantitative proof trees, some further concepts have to be intro- 
duced. 

Note that the definitions of renaming, p- variant, and variant carry over to the quan- 
titative case without changes. Clearly, we have the property that a constraint language 
TZ(C) containing quantitative definite clauses is closed under renaming if the underlying con- 
straint language C is closed under renaming. Furthermore, for each such generalized constraint 
language 1Z{C) which is closed under renaming, and for each 1Z(£) -interpretation A , we have 
that A is a model of an 1Z(£) -constraint iff A is a model of each of its variants. 

Next, we have to redefine a complexity measure for goal reduction for the quantitative 
case. This measure is crucial in proving termination of goal reduction and works by keying 
steps of the minimal model construction to steps of the goal reduction process. 

• The complexity of a variable assignment a for an atom r(x) in the minimal model A s.t. 

/j,(a(x); r" 4 ) > is defined as 

comp(a, r(x), A ) := min{i| ji{a{x); r 4 ) = fi(a(x); r" 4 ')}; 

• The complexity of a for goal G = r\(xi) & ... & rk{xk) & in ^4 s.t. a £ 14>}' A and 
p(a(xi); rj- 4 ) > for all i : 1 < i < k is defined as 

comp(a, G,A) := {comp(a, (xi), A ) | 1 < i < k} 

where {. . . } is a multiset. 
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• The V-complexity of a for goal G = r\{xi) & ... & rki^k) & <f> in A s.t. a G [</>lv an d 
rj" 4 ) > for all i : 1 < i < k is defined as 

comp\j(a, G,A) := min{comp((3 , G, .A ) | /3 G M" 4 , n(P(xi);ri A ) > 

for all i : 1 < i < k and q = /3|v}- 

The minimum is taken with respect to a total ordering on multisets s.t. M < M' iff 
Vx G M \ M', 3x' EM'\M s.t. x < a/. 



The following proofs show that the quantitative proof procedure is sound and complete 
with respect to the above stated semantic concepts. Again, there is a close similarity to the 



corresponding statements for the non-quantitative case of Hohfeld and Smolka (198- 



Theorem 3.4 (Soundness). For each quantitative definite clause specification Vf , for each 
goal G, for each C -constraint ip: If there is a quantitative proof tree for G from Vf with answer 
constraint ip and root value v, then <p v —* G is a logical consequence of Vf ■ 

Proof. The result is proven by induction on the depth d of the quantitative proof tree, where 
one unit of depth is from max-node to max-node. 



Base: We know that quantitative proof trees of depth d = have to take the form of a single 
max-node labeled by a satisfiable C -constraint ip with root value 1. Then ip i — > ip is a 
logical consequence of Vf ■ 

Hypothesis: Suppose the result holds for quantitative proof trees of depth d < n. 

Step: Let Go = r{x) k, <p be a goal labeling a quantitative proof tree of depth d = n with 
answer constraint ip and root value h, 

let G = q% (x*i) & . . . & Qki^k) & 4> & 4>' be a goal labeling the min-node obtained from 
Go via — — > using the variant C' = r(x) <—f(p'&z qi(xi) &...&; qk{xk) of a clause G in 
V F , 

and let G\ = q\(xi) &</>",.•• ■> Gfc = qk{xk) & <P" be goals labeling max-nodes obtained 
from G' Q via — —> . 

Then each goal G\, . . . , G& labels a quantitative proof tree of depth d < n with respec- 
tive answer constraint ip\,... ,ipk and root value g\, . . . , s.t. h = f x min{gi, ... , g^} 
and for each model A of Vf '■ [V']" 4 = hPi & • • • & V'fcl" 4 ) by definition of min/max tree 

==> ipi gi — ► Gi , . . . , g fe — >• Gk are logical consequences of Vf , by the hypothesis 

for each model A of , for each a G ASS : [^J- 4 C \<p"\ A and if a G J^]" 4 , then 
/j,(a(xi); q\ A ) > gi, ■ ■ ■ , /x(a(x/c); Qfc" 4 ) > <7fc, by definition of logical consequence 
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=> for each model A of Vf , for each a G ASS : {ip] A C 14>'] A and if a G M' 4 , then 
n(a(x);r A ) > / X min{^(a(xi); gi" 4 ), . . . , /u(a(xfc); g^- 4 )}, since each model .4. of 
Vf is a model of C iff .4 is a model of C 

==> for each model A of Vf > for each a € ASS : [V']" 4 Q I'P}"^ an d if a € M" 4 ) then 
fj,(a(x); r A ) > /i 

=> ip h —> r (x) & 4> is a logical consequence of TV • 

The result follows by arithmetic induction. □ 



Theorem 3.5 (Completeness). Let Vf be a quantitative definite clause specification in 
TZ(C) , C be closed under renaming, Abe a minimal model o/Vf , G be a goal of the form r(x) 
& (j), a G \4>\y and [i{(3{x);r A ) = v s.t. v > and a = (3\y. Then there exists a quantitative 
proof tree for G from Vf with answer constraint ip and root value v and a G {(fly- 

Proof. The result is proven by induction on c = compy(a, G,A). 

Base: We know that goals with complexity c = have to take the form of a satisfiable C - 
constraint x- Then there exists a quantitative proof tree for \ from Vf consisting of a 
single max-node labeled with \ and root value 1. 

Hypothesis: Suppose the result holds for goals with complexity c < N. 

Step: Let Go = q(x) & ip, a' G fiply, a" G lip] A , a' = a"\y, compy{a' , Go, A) = 
comp(a", Go, A) = N, comp(a", q(x),A ) := i, /J,(a"(x); q A ) = h and h > 0. 

First we observe, that fj,(a"(x);q Al ) = h, since comp(a" , q(x) , A ) := i 

==> there exists a variant q(x) <— f if)' & qi(xi) & ... & qk(%k) s.t. h = 
f x min{/Li(a(xi);gi A - 1 ), ••• , ^{a{x k ); q^ 1 ' 1 )} and a" G [^'l -4 * -1 and 



(V U V(V>)) n V(V>' & gi (xi) & ... & q k (xk)) Q V(g(x)), by Definition ^2 
and renaming closure of TZ(C) , finite V and infinitely many variables in VAR 
Go ^ G s.t. G = & ... & q k (x k ) k and = & by 

definition of the inference rules. 

Next, a' G [V>"]$, since a" G \ip\ A , a" G [V^]" 41 - 1 C \vb'\ A , a" G & ^'l* 4 , 

[^&V'Iv l = Klv and ot = a"\y. 
Finally, compy(a', G' Q , A ) < iV, since compy(a', G' Q , A ) < comp(a" , G' , A ) < {?} = 

{comp(a", q(x), A )} = comp(a" , Go, *4 ) = compy(a', Go, -4 ) = iV. 

Now we can obtain goals G\ = q\{xi) & ip" , ■ ■ ■ ,G k = q k (x k ) & V 7 " from G' s.t. 
«' G I^'lv' PW{xx)] qx A ) = gi > 0, . . . , n(a!'(x k ); q k A ) = g k > 0, a' = a"j v and 
campy {a' , G%, A) < N, . . . , compy(a', G k ,A) < N 
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for each goal G±, . . . , G^, there exists a quantitative proof tree from Vf with 
respective answer constraint Xi-, - ■ ■ ,Xk an d respective root value 

91, ■ ■ ■ , g'k = 9k and a' G [xi & • • • & Xkjy = [xlv > by the hypothesis 



9[ 



there exists a quantitative proof tree for Go from Vf with answer constraint \ an d root 
value ti = f x min{pi, . . . ,g' k } = / x min{#i, ... ,g k } = h and a' € [xlv- 



The result follows by arithmetic induction. 



□ 



Returning to our toy example, the proof procedure for quantitative definite clause speci- 
fications can be illustrated as follows. A min/max derivation tree for the query q(X) &zX = e 



and the program of Fig. 3.1 is given in Fig. 3.3. 




Figure 3.3: Min/max tree for quantitative constraint logic program 

This tree contains two success nodes, X = a and X = b, from which two distinct quanti- 
tative proof trees can be obtained (see Fig. |3.4| ). 

Soundness of quantitative CLP tells us that corresponding to the quantitative proof tree 
with answer constraint X = a (resp. X = b) and root value .7 (resp. .5), we know that the 
quantitative formula X = a .7— > q(X) & X = e (resp. X = b .5— > q(X) & X = e ) is a logical 
consequence of the program of Fig. |3.1|. This can easily be verified from the minimal model 



constructed in Fig. 3.2 



Completeness says that for an object (a(X)) assigned by a £ [X = ej 1 with membership 
degree fj,((a(X)) ; q-^) = .7 to the denotation of q under the minimal model A , we have a 
corresponding proof tree with answer constraint X = a and root value .7 and a G |X = a} 1 . 
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q{X) k X = e 

max{.7}=.7 

I r 

l,p(X) kX = e 

lxmin{.7}=.7 
I C 

p(X) kX = e 

max{.7}=.7 
| r 

2,X = ekX = a 

.7xmin{l} 
I C 

X = a 
l 



q{X) k X = e 

max{.5}=.5 

r 

l,p(X) kX = e 

lxmin{.5}=.5 

C 

p(X) kX = e 

max{.5}=.5 

r 

3,X = ekX = b 

.5xmin{l} 

C 

X = b 
l 



Figure 3.4: Quantitative proof trees for quantitative constraint logic program 

Similarly, for an object (a'(X)) with a' G {X = ej 2 and /j,((a'(X)) ; q" 4 ) = .5, we have a proof 
tree with answer constraint X = b and root value .5 and a' <G \X = b} x . 



3.6 Parsing and Searching in Quantitative CLGs 



The quantitative CLP scheme presented in the last chapter allows for a definition of the pars- 
ing problem (and similarly of the generation problem) for quantitative CLGs in the following 
way: Given a program V f (encoding some quantitative CLG) and a query G (encoding some 
input sentence) , we ask if we can infer a Vf -answer <p of G (encoding a parse of the input 
sentence) at a value v (encoding the weight of the parse) proving ip v — > G to be a logical 
consequence of Vf ■ That is, according to the soundness and completeness results presented 
above, the operational concept of a quantitative proof tree has a declarative counterpart in 
the form of a quantitative Vf -answer. Truth-functionally, a quantitative Vf -answer tells us 
that the answer constraint ip contributes a truth-value v to the goal G in every model of 
Vf ■ In terms of membership values, this means that a Vf -answer to a query G = r(x) k <j) 
at value v is a satisfiable C -constraint if such that for each model A of Vf holds: If ip is 
satisfiable, then (p is satisfiable and all objects assigned to x by a solution of (p are in the 
denotation of r(x) at a membership value of at least v. 
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3.6.1 Quantitative Feature-Based CLGs 



Returning to the simple linguistic CLG of Fig. 2.7, the formal scheme described above can 
be illustrated as follows. 



The quantitative constraint logic program Vf of Fig. 3.5 is obtained from the program of 



Fig. |2.7| simply by adding numerical factors to the program clauses. 

1 phrasepf) <- fl X = (phrase A CAT : s A DTR1:CAT : n A DTR2:CAT : v A 

DTR1:AGR : F ADTR2:AGR : FADTR1 : Z\ A DTR2 : Z 2 )&sign(Zi)&sign(Z 2 ). 

2 phrase(X) X = (phrase A CAT : np A DTR1:CAT : n A DTR2:CAT : n A DTR1 : 

Zi A DTR2 : Z 2 ) & sign(Zi) & sign(Z 2 ). 

3 word(X) <- /3 X = (word A CAT : n A PHON : Clinton A AGR : sg). 

4 word(X) *- fi X = (word A CAT : v A PHON : talks A AGR : sg). 

5 word(X) <- /a X = (word A CAT : n A PHON : talks A AGR : pi). 

6 sign(X) phrase(X). 

7 sign(X) word(X). 

Figure 3.5: Quantitative feature-based constraint logic grammar 



Given the quantitative CLG of Fig. 3.5 and a goal G of the form 



X = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & sign(X), 

again encoding the input sentence Clinton talks, we can infer two different proof trees for G, 
each with a specific answer constraint, encoding a parse, and a specific root value, encoding 
the preference value of the parse. Again, we will depict only success branches and consider 



the constraint solver as a black box. The two derivations are shown in Figs. |3.6| and |3.7 , 

The answer constraint (ft of the first derivation is obtained by constraint solving of the 
terminal constraints of the first proof tree. We get 

*[...]&* [. ..]-^X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 
word A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg) 

yielding the reading [Clinton^ talksyjs with weight 

v = h x h x min{f 7 x / 3 , f 7 x / 4 }. 

The answer constraint ip of the second derivation is 
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X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & sign(X) 
f 6 x fi x min{f 7 xf 3 ,frx f 4 } 

I r 

6, X = {sign A DTRl:PHON : Clinton A DTR2: PHON : talks) & phrase {X) 
f 6 x fi x min{fT x f 3 , f 7 x f 4 } 

I C 

X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & phrase(X) 

fi x min{f7 x f3,f7 x f 4 } 

I T 

1, X = (sipn A DTR1: PHON : Clinton A DTR2: PHON : iaZfcs) 
&X = {phrase A CAT : s A DTR1: CAT : n A DTR2: CAT : v A DTR1:AGR : Y 
A DTR2:AGR : Y A DTR1 : Z x A DTR2 : Z 2 ) & sign(Zi) & sign(Z 2 ) 
fi x min{f 7 x f3,f7 x f4} 

c 

*[X = {phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1:AGR : Y A DTR2 : word A DTR2: CAT 
A DTR2: PHON : talks A DTR2:AGR : Y A DTR1 : Zi A DTR2 : Z 2 )] 

& sign(Zi ) 
f 7 x f 3 
r 



& sign(Z 2 ) 
f 7 x f 4 



7, * [X = {phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1:AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2:AGR : Y A DTR1 : Z x A DTR2 : Z 2 )] 

& word(Zi) 
f 7 x f 3 
c 



.] &word(Z 2 ) 
f 7 X f 4 

I c 

& word(^2) 
f 4 

r 



*[X = {phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1:AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2:AGR : Y A DTR1 : Zi A DTR2 : Z 2 )] 

& word(Zi) 
f 3 



4, *[...] 
& Z 2 = {word A CAT 
A PHON : talks A AGR 
f 4 

I C 
*[■■■] 



3, * [X = {phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1:AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2:AGR : Y A DTR1 : Z l A DTR2 : Z 2 )\ 
hZ x = {word A CAT : n A PHON : Clinton A AGR : sg) 

f 3 

I C 

*[X = {phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1:AGR : Y A DTR1:AGR : sg 
A DTR2 : word A DTR2: CAT : v A DTR2: PHON : talks 
A DTR2:AGR : Y A DTR2:AGR : sg)] 
1 



Figure 3.6: Quantitative derivation of [Clinton^ talksy]s 
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X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & sign(X) 
f 6 x f 2 x min{fT x f 3 ,fr x f 5 } 

I r 

6, X = {sign A DTRl:PHON : Clinton A DTR2: PHON : talks) & phrase {X) 
f 6 x f 2 x min{fT x f 3 ,fr x f 5 } 

I C 

X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) & phrase(X) 

f 2 x min{f7 x f3,f7 x f 5 } 

I T 

2,X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 
k,X = {phrase A CAT : np A DTR1: CAT : n A DTR2: CAT : n 
ADTR1 : Zi A DTR2 : Z 2 ) 
k sign(Zi) & sign(Z 2 ) 
f 2 x min{f7 x f3,f7 x fs} 

c 



*[X = {phrase A CAT : np A DTR1 : word A DTR1: CAT 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT 
A DTR2: PHON : talks A DTR1 : Z x A DTR2 : Z 2 )] 
& sign(Zi ) 
f 7 x f 3 



..]&sign(Z 2 ) 
f 7 X f 5 

r 

[. . .] &word(Z 2 ) 
f 7 x f 5 



7, *[X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : n 
A DTR2: PHON : talks A DTR1 : Z x A DTR2 : Z 2 )] 
& word(Zi) 
f 7 x f 3 



word A DTR1: CAT 
word A DTR2: CAT 
: Zi A DTR2 : Z 2 )] 



*[X = {phrase A CAT : np A DTR1 : 
A DTR1: PHON : Clinton A DTR2 : 
A DTR2: PHON : talks A DTR1 
&word(Zi) 
fa 
r 

3, * [X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : n 
A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 )) 
& Zi = (word A CAT : n A PHON : Clinton) 
fa 

c 

*[X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks)] 



. .] &word(Z 2 ) 
fs 



& Z 2 = (word A CAT : n 
A PHON : talks A AGR : pi) 
fs 

c 

t[X = {phrase A CAT : np 
A DTR1 : word A DTR1: CAT 
ADTR1: PHON : Clinton 
A DTR2 : word A DTR2: CAT 
A DTR2: PHON : talks 
ADTR2: AGR : pi)] 
1 



Figure 3.7: Quantitative derivation of [Clinton^ talks^NP 
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*[...]& \[...]-l+X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 
n A DTR2: PHON : talks A DTR2: AGR : pi) 

yielding the reading [Clinton^ talks^NP with weight 

t = fa x fa x mm{/ 7 x / 3 , / 7 x / 5 }. 

Suppose now that we have a subjective weight assignment for the factors of the quantita- 



tive CLG of Fig. where fa > fa and fa > fa. That is, we prefer the rule S — > V over the 
rule NP — > iV to describe a phrase. Furthermore, the terminal rule V — > talks, encoding 
the word talks as a verb, is preferred over the rule N — > talks, encoding it as a noun. Clearly, 
we get a preference of the answer constraint </>, encoding the reading [Clinton^ talksy]s, over 
the answer constraint ip, encoding the reading [Clinton^ talks^NP, with v > r. 

3.6.2 Alpha-Beta Searching in Quantitative CLGs 



As proposed by van Emden (198(f ), search strategies such as alpha-beta pruning that are 



well known in game theory can be used quite directly to define efficient search strategies 
for quantitative rule sets. The same technique can be applied to the proof procedure of 
quantitative CLP. Alpha-beta pruning is a technique to speed up the search in min/max 
trees without loss of information. For our application, alpha-beta pruning can be used to 
efficiently search a min/max derivation tree for the maximal valued proof tree. The fact that 
no information is lost in alpha-beta pruning means in our context that the maximal valued 
proof tree is guaranteed to be found. Furthermore, in general, the amount of search needed 
to find the best proof for a goal, i.e. the maximal valued proof tree for a goal from a program, 
will be reduced remarkably by controlling the search by the alpha-beta algorithm. 



The central concepts of alpha-beta pruning can be summarized as follows (see Nilsson 

PUD)- 



Usually some form of depth-first search is employed in alpha-beta pruning. The search 
procedure associates with each max-node (resp. min-node) a dynamic alpha-value (resp. beta- 
value). These values are based on the static values of terminal nodes and will be backed- up 
in subsequent search by lookahead in the tree. 

The search procedure starts with a maximum depth execution of depth-first search, ini- 
tializing the alpha and beta values of the first subtree. During search, alpha and beta values 
are computed as follows: 

• The alpha value of a max-node is the maximum of the current values of its successors. 

• The beta value of a min-node is the minimum of the current values of its successors, 
multiplied by the factor of the clause labeling the min-node. 
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The rules for discontinuing the search are as follows: 

• Alpha-cutoff : Search can be discontinued below any min-node having a beta value less 
than or equal to the alpha value of any of its max-node ancestors. The final backed-up 
value of this min-node can then be set to its beta value. 

• Beta-cutoff : Search can be discontinued below any max-node with the product of its 
alpha value and the factor of the rule labeling its min-node ancestor being greater than 
or equal to the beta value of this min-node ancestor for all min-node ancestors. The 
final backed-up value of this max-node can then be set to its alpha value. 

The procedure terminates when all of the successors of the root node have been given a final 
backed-up value. The maximal valued proof tree is then the one taking as single successor of 
each of its max-nodes the successor with the maximal final backed-up value. This proof tree 
is found efficiently if the original min/max tree can be pruned by the alpha-beta procedure 
to a tree consisting of a relatively small number of nodes. 

Let us illustrate these concepts with a simple example. A sample artificial program is 



given in Fig. 3.5. 



lp(X) <-. 7 r(J0&sp0. 

2 x(X) <-. 8 X = a. 

3 s(X) <-.g X = a. 

4 s(X) <-. 2 r(A). 

5 p(A) <-. 7 t(X) & r(X) & s(A). 

6 t(X) <-.! X = a. 

Figure 3.8: Quantitative constraint logic program 

The complete min/max derivation tree for the query p(X) & X = a to the program of Fig. 
|3.8| is given in Fig. |3.9| . 

The concept of alpha-beta pruning can be illustrated with this example as follows (see Fig. 
|3.10| ). The alpha value a = .9 of the max-node s(X) & X = a times the factor .7 of the min- 
node ancestor is greater than the beta value (3 = .56 of this min-node. Since we know that this 
alpha value cannot be decreased by further evaluation of the subtrees of this max-node, and 
since we are interested in the minimum of the values of the successors of this min-node, we can 
cut off the search below this max-node without a risk of losing information relevant to the final 
maximal valued proof tree. This cutoff is indicated by the dotted line below this max-node 



in Fig. 3.10 . In a similar way, search below the min-node 5, t(X) & r(A) Sz s(X) SzX = a can 
be discontinued because the non-decreasing beta value j3 = .07 of this node is already smaller 
than the alpha value a = .56 of its max-node ancestor. The pruning of the two subtrees of this 
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p(X) & X = a 
max(.56,.07) = .56 



l,r(X) & s(X) &X = a 
.7xmin(.8,.9) = .56 



5,t(X) &r(X) & s(X) &X = a 
.7xmin(.l,.8,.9) = .07 



r(X) & X = a 



s(X) &X = a 

max(.9,.16) = .9 



t(X)&X = a r(X)&X = a 
.1 .8 



s(X) & X = a 

max(.9,.16) = .9 



X = a&X = a 3,X = a & X = a 4,r(X)&X = a 6,X = a&X = a 2,X = akX = a 3,X = a&X = a 4,r(X)&X 
.8 .9 .2x.8 .1 .8 .9 .2x.8 



C 

X = a 
i 



c 

X = a 
1 



r(X) & X = a 



X = a 
i 



c 

X = a 
i 



c 

X = a 
i 



r(X) &X 



2,X = a& X = a 
.8 

I C 
X = a 

i 



2,X = a&X 

.8 

C 

X = a 
i 



Figure 3.9: Complete search of a quantitative derivation tree 



min-node again is indicated by dotted lines in Fig. 3.10. Again, there is no risk of information 
loss in this pruning step. 

Clearly, in each application of the alpha-beta procedure, the number of nodes to be gen- 
erated and evaluated is minimal when the number of cutoffs is maximal. The best case occurs 
when the maximal valued proof tree is reached first in the depth-first search. In the worst 
case, no gain in search efficiency is obtained at all, i.e., all nodes of the min/max tree have to 
be generated. In either case, the maximal valued proof tree is guaranteed to be left unpruned. 
Risking loss of relevant information, the alpha-beta procedure can be improved by setting an 
initial alpha value for the root note which allows to cut off search branches with root value 
lower than this initial value. For a thorough analysis of the properties of alpha-beta pruning 
the reader is referred to Knuth and Moore (1975| ), 

Furthermore, it should be noted that a strict application of alpha-beta pruning is possible 
only for quantitative CLP based on min/max trees. Suppose for example that the minimum 
operator is replaced by a product operator throughout the declarative as well as operational 
semantics of quantitative CLP. This replacement could be motivated by the aim to consider 
the contribution of all instead only one antecedent atom to the weight of the consequent. To 
efficiently search for the maximal valued proof tree in such a setting, a version of alpha-beta 
pruning employing only alpha-cutoffs has to be used. In this setting, additional beta-cutoffs 
can improve the search efficiency for finding a good proof tree, but possibly cut off parts of 
the best proof tree, i.e., here attention has to be paid to the risk of losing information relevant 
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Figure 3.10: Alpha-beta search of quantitative derivation tree 



to the maximal valued proof tree. 



3.7 Summary and Discussion 

In this chapter we presented a formal framework for quantitative CLP. In this framework CLP 
clauses were attached by arbitrary numerical weights. Such weighted clauses were interpreted 
in a model-theory based on concepts of fuzzy set algebra. The quantitative system was shown 
to be sound and complete with respect to a fixpoint semantics based on minimal models in 
this model-theory. We illustrated the concepts of quantitative CLP by a simple quantitative 
feature-based CLG. Furthermore, we showed how to adapt the search algorithm of alpha- 
beta pruning to searching efficiently for the highest weighted proof tree in a quantitative CLP 
system. 

The advantage of quantitative CLP clearly is the freedom it offers to the grammar writer 
or implementer to specify arbitrary weights in a formally clear and efficient programming 
framework. Such weights could be specified, e.g., as subjective preference values (Erbach 



1993b, Kim 1994, Douglas and Dale 1992), subjective values expressing graded grammatical- 



ity ( [Erbach 1993a|) , or subjective probabilities ( Erbach 1998| ). Calculation with arbitrary such 



weights can be interpreted in a unique well-defined formal framework. Furthermore, general- 
izations of the formal system to particular applications which require particular calculation 
schemes are easily obtainable. For example, if we want to model probabilistic context-free 
grammars ( Booth and Thompson 1973| ) in quantatitive CLP, we simply must attach subjec- 



tive probability measures to a context-free program according to the conditions of Booth and 



Thompson (1973] ), and replace the minimum operator by a product operator in all relevant 
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definitions of the declarative and operational semantics of quantitative CLP. Unfortunately, 
such changes in the weight calculation model prevent a direct application of alpha-beta prun- 
ing for efficient disambiguation. For example, in the case of probabilistic context-free CLP, 
only a restricted version of alpha-beta pruning using exclusively alpha-cutoffs is applicable. 
Alternatively, one could use a form of best-first pruning for the search task, i.e., a search rule 
which selects the highest weighted clause at each derivation step. Clearly, this approach does 
not guarantee that the highest weighted proof tree is found, but offers only an approximate 
heuristic search procedure. 

However, regardless of the specific choice of weights, a proper specification of a multitude of 
weights can be very complex and is always user-dependent. In several applications, one would 
like to trade in the flexibility of subjective weight assignment for automatic and reusable 
methods for estimating weights from empirical data. One solution to this problem is to use 
automatic methods for statistical inference to induce values of probabilistic parameters from 
empirical data. 

In the next chapter, we present an framework of probabilistic CLP which addresses the 
problem of finding a proper probability distribution over the set of proof trees of a constraint 
logic program and of using statistical estimation methods to infer parameters from empirical 
data. Clearly, even if it would be possible to specify a model-theoretic semantics for such a 
system, it is superfluous to do so in the context of automatic statistical inference. Rather, the 
interest is here in the stochastic semantics of CLP provided by the probabilistic and statistical 
methods used. 



Chapter 4 

Probabilistic CLP: Probabilistic 
Modeling and Statistical Inference 
from Incomplete Data 



In this chapter we present a probabilistic model for CLP and a novel method for statisti- 
cal inference of the parameter values of such a model from incomplete training data. We 
show mononoticity and convergence of the new algorithm to the desired maximum likelihood 
estimates. Furthermore, we show the usefulness of the statistical approach by a small-scale 
experiment on estimating feature-based CLGs. We present a novel algorithm to infer the 
properties of such parametric probability models from incomplete data and discuss different 
approaches for approximate computation for the inference task. Moreover, we discuss the 
possibilities of using the structure of the probabilistic model to guide the search in finding 
the most probable proof tree in probabilistic CLP and present as heuristic search method for 
this task. 

This chapter is based upon work previously published in Rlezler (1997) ), [Riezler (1998a| ), 
Riezler (1998b), and Johnson, Geman, Canon, Chi, and Riezler (1999). 



4.1 Introduction and Overview 



In the previous chapter we presented a formal semantics for a system of quantitative CLP. This 
formal semantics and the connected quantitative inference system were crucially based upon 
open parameters for subjective weights. Most approaches to probabilistic logic programming 
interpret such weights as subjective probabilities, and concentrate on inference systems and 
formal semantics for programming systems with user-defined probabilities attached to the 
formulae of the language. The aim of such approaches is the development of sound and 
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complete logic programming systems where the handling of weights is restricted to accord to 
the laws of probability theory. That is, these approaches aim to connect logical inference with 
probabilistic inference. 

In this chapter, we present a completely different approach to probabilistic CLP. In this 
approach, subjective assignment of probabilities is replaced by automatic and reusable meth- 
ods for estimating empirical probabilities from data. The central aims of this approach are 
the specification of a probability distribution over the proof trees given by a program, and the 
use of statistical methods to infer the values of the probabilistic parameters from empirical 
data. That is, in this setting, the weight of a CLP proof tree is determined directly by a 
probability distribution over proof trees rather than by quantitative calculation scheme refer- 
ring to weighted clauses. The parameters of the probability distribution are determined by 
statistical inference from empirical data rather than by an assignment of subjective weights 
to clauses. Furthermore, the specific properties of the parametric probability model can be 
inferred by statistical methods. That means, in this chapter we do not only turn from quan- 
titative to probabilistic inference but, what is more, to statistical inference. In such a setting, 
the connection of probability theory, semantic fixpoint theory and logical inference theory is 
not of interest since the specification of probabilistic parameters is done by automatic statis- 
tical methods and not manipulable by the user. Rather, we are interested in the stochastic 
semantics defined by the methods of probabilistic modeling and statistical inference. 

The statistical problem we consider here is the problem of statistical parameter estimation. 
We assume that the statistical properties of a given sample of observations O = 0\, . . . , O n 
can be described by a parametric family of probability distributions. That is, the probability 
distribution that generated the data is assumed to be completely known except for the values 
of a vector 9 of parameters. We then ask how the unknown value of 9 can be estimated 
from the observation sequence O, i.e., a statistical inference is made about the values of 
the parameters defining that family. Recent interest in statistical approaches to NLP can 
be attributed to the fact that solutions to such statistical problems can lead quite directly 
to effective, but conceptually simple and mathematically clear solutions to various problems 
in NLP. In the context of structural ambiguity resolution in NLP systems, this connection 
is as follows: Given a probabilistic grammar depending on parameter vector 9 and given a 
training corpus O, a solution 9 to the parameter estimation problem will adapt the model 
parameters to best account for the input corpus. This tuning of the grammar to a particular 
natural language corpus is a necessary prerequisite for probabilistic disambiguation. That is, 
when the plausibility of a parse is connected with its probability, the assumption that the 
correct parse of a sentence is its most probable parse can be made with some justification if 
the underlying probabilistic grammar is based on parameter values 9 estimated from large 
data sets of natural language. 

The aim of this chapter is to solve open problems in statistical inference and probabilistic 
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modeling of constraint-based grammars. Following [Abney (1996| ), we choose the parametric 
family of log- linear probability distributions to model such grammars. The great advantage 
of log-linear models is their generality and flexibility. Log-linear models allow to describe 
arbitrary context dependencies in the data by choosing a few salient properties of the data as 
the defining properties of the model. In contrast to most approaches to probabilistic grammars, 
with log-linear models we are not restricted to build our models on production rules or other 
configurational properties of the data. Rather, we have the virtue of employing essentially 
arbitrary properties in our models. For example, heuristics on preferences of grammatical 
functions or on attachment preferences as used in prinivas, Doran, and Kulick (199~5| ) , or the 
preferences in lexical relations as used in Alshawi and Carter (1994| ) can be integrated into a 
log-linear model very easily. However, the step from simple rule-based probability models to 
general log-linear models requires also a more general and more complex estimation algorithm. 
The estimation algorithm for log- linear models used by Abney (1996) is the iterative scaling 
method of Delia Pietra, Delia Pietra, and Lafferty (1997 ). This algorithm allows to recast the 
optimization of weights of preference functions as done by [Srinivas" Doran, and Kulick (199^ ) 
or Alshawi and Carter (199~I} ) as estimation of parameters associated with the properties of 
a log-linear model. However, there is a drawback: In contrast to rule-based models where 
efficient estimation algorithms from incomplete, i.e., unannotated data exist, the iterative 
scaling estimation method of Delia Pietra, Delia Pietra, and Lafferty (1997 ) applies only to 
complete, i.e., fully annotated training data. Unfortunately, the need to rely on large samples 
of complete data is impractical. For parsing applications, complete data means several person- 
years of hand-annotating large corpora with specialized grammatical analyses. This task is 
always labor-intensive, error-prone, and restricted to a specific grammar framework, a specific 
language, and a specific language domain. 

Thus, the first open problem to solve is fo find automatic and reusable techniques for 
parameter estimation of probabilistic constraint-based grammars from incomplete data. We 
will present a general estimation algorithm for log-linear models from incomplete data which 



can be seen as an extension of the iterative scaling method of Delia Pietra, Delia Pietra, 



and Lafferty (1997). We prove monotonicity and convergence of the new algorithm to (local) 
maxima of the incomplete-data log-likelihood function, and show how automatic property 
selection can be done from incomplete data. 

A further open problem is the empirical evaluation of the performance of probabilistic 
constraint-based grammars in terms of finding human-determined correct parses. We present 
an experiment with a log-linear model employing a few hundred general properties encoding 
grammatical functions, attachment preferences, branching behaviour, parallelism, and other 
general properties of constraint-based parses. The experiment was conducted on a small scale 
but clearly shows the usefulness of general properties in order to get good results in a linguistic 
evaluation. 
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Clearly, for larger scales, problems arise concerning the tractability of the estimation 
formulae. We discuss the applicability of several approximation methods to our problem of 
statistical inference from incomplete data, including Newton's method, Monte Carlo methods, 
or methods for approximating expectations via pseudo- likelihood approaches. 

A further open problem is the efficient search for most probable parses, i.e, best-parse 
search, in parsing systems based on probabilistic constraint-based grammars. Instead of list- 
ing all possible parses and selecting the most probable one, one would like to use the structure 
of the probabilistic model to guide the search for the most probable analysis. Most popular ap- 
proaches use the search technique of the Viterbi algorithm ( Viterbi (1967 ), Forney (1973| )) to 



solve this problem, but there is as yet no solution for probabilistic constraint-based grammars. 
We show that standard methods for best-parse search are only of limited use for probabilistic 
models involving context-dependencies, and make the move to approximate heuristic methods. 

To summarize, our approach satisfies the following requirements. It 

• is generally applicable to probability models involving context-dependencies, and espe- 
cially to a probabilistic model for CLP over arbitrary constraint languages, 

• provides automatic and reusable techniques for statistical inference from incomplete 
data for such probability models, and 

• is accompanied with search techniques for finding most probable analyses in probabilistic 
CLP. 



This chapter is organized as follows. Is Sect. 4.2 we discuss related previous approaches 
to statistical inference for probabilistic constraint-based grammars. 

In Sect. |4.3| we introduce the basic concepts of maximum likelihood estimation from in- 
complete data via the EM algorithm. 



Sect. 4.4 discusses the problem of applying a popular instance of this algorithm, namely 
Baum's maximization technique for stochastic context-free models, to parameter estimation 
for probabilistic CLP. 



Sect. 4.5 and 4.6 present in detail a solution to this problem by introducing a log-linear 
probability model for CLP coupled with an incomplete-data inference algorithm for such 
models. This section includes a detailed proof of monotonicity and convergence of the inference 
algorithm. 



Sect. L7 presents an empirical evaluation of the applicability of general log-linear distri- 
butions to probabilistic constraint-based grammars in a small-scale experiment on estimating 
a log-linear model on constraint-based parses. 



Sect. 4.8 discusses computation issues such as the use of Monte Carlo methods, New- 



ton's numerical method, and other approximation techniques in the context of this inference 
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process. 

Sect. [O] discusses the applicability of standard parsing and search methods to context- 
dependent constraint-based models, and presents a heuristic method for searching for best 
parses in CLGs. 



4.2 Previous Work 

An approach to define estimators for probabilistic constraint-based grammars which has been 
applied to nearly all constraint-based formalisms is a renormalized extension of the estima- 



tor for stochastic regular (Baum, Petrie, Soules, and Weiss 1970) or context-free grammars 



(Baker 1979) to constraint-based models. Examples for this approach are, e.g., stochastic 



unification-based grammars (Briscoe and Waegner 1992; Briscoe and Carroll 1993| ), stochastic 



constraint logic programming (Eisele 1994), stochastic head-driven phrase structure grammar 



(Brew 1995), stochastic logic programming (Miyata 1996), stochastic categorial grammars 



( Osborne and Briscoe 1997 ) or data-oriented approaches to lexical- functional grammar ( Bod 
|and Kaplan 1998| ). Since the estimation technique for context-free models is based on the 
assumption of mutual independence of the model's derivation steps, but context-dependent 
constraints on derivations are inherent to constraint-based grammars, a loss in probability 
mass due to failure derivations is caused in these approaches. However, the necessary renor- 
malization of the probability distribution on derivations with respect to consistent derivations 
causes a general deviance of the resulting estimates from the desired maximum likelihood es- 



timates. This was shown firstly by Abney (1996j ) for estimation of constraint-based models 



from complete data. We will make a similar argument for incomplete data in the following. 
Optimization-theoretically these approaches can be described as maximization procedures 
for pseudo-likelihood functions for context-free models where the probability distribution on 
context-free derivations is restricted to consistent derivations in the constraint-based sense. 
Maximum pseudo-likelihood estimators for context-free models certainly are sensible, e.g., 
if the aim is to constrain an inherently context-free language to include only linguistically 
plausible derivations as is done by introducing bracketing constraints on context-free deriva- 
tions by [Pereira and Schabes (19~92 ). However, it is questionable if is the best way to model 



constraint-based grammars probabilistically by context-free models which respect constraints 
only indirectly to discard derivations. The move to log- linear models as is done in our ap- 
proach clearly has several advantages. Since there is linguistically no reason to base probabilis- 
tic grammars on rule-properties, we can now exploit the flexibility of log-linear distributions 
and model the context-dependencies in the data directly. Furthermore, since the new fam- 
ily of parametric probability models requires new estimation techniques, we can again take 
consistent maximum likelihood estimators as the optimization procedures of our choice. 

Other approaches to probabilistic constraint-based models have been presented which de- 
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fine custom-built statistical inference procedures for specialized parsing models including a 
limited amount of context-dependency. For example, the model presented by Goodman (1998; ) 



conditions on a finite set of categorial features beyond the nonterminal of each node which 
makes it possible to explicitly unfold the dependencies in the parsing model. This allows for 
the use of standard dynamic programming techniques for computation. In the approaches 



of Magerman (1994 ) and Ratnaparkhi (1998 ) general statistical inference methods, namely 



decision trees and maximum-entropy methods, are used to infer weights associated to the 
actions of specialized parsing models including limited context-dependency. However, it is 
difficult to generalize these models to arbitrary log-linear models on constraint-based gram- 
mars, concerning both the choice of properties and the issue of efficient computation. Clearly, 
a careful choice of properties and dependencies makes it possible to tune specialized models 
to maximum accuracy and efficiency, which does not hold for the general case^J The aim of 
our approach is to address problems concerning estimation, property design, or approxima- 
tion methods for general log-linear models and show these general ideas to be applicable in 
practice. 

4.3 Maximum Likelihood Estimation from Incomplete Data 
via the EM Algorithm 

A constant companion during the course of this chapter will be the statistical estimation 
technique of the Expectation-Maximization (EM) algorithm. The fact that both Baum's esti- 



mation technique, which is shown not to be applicable to probabilistic CLP in Sect. |4.4.2| , and 



the incomplete-data estimation algorithm for log-linear models we present in Sects. 4^-4.8 



can be seen as instances of the EM algorithm, justifies a closer look at this estimation scheme. 



4.3.1 General Theory of the EM Algorithm 



The EM algorithm has been introduced by Dempster, Laird, and Rubin (1977 ), although 



central parts of the general theory can be found earlier in special applications, e.g., in Baum 



Pctrie, Soules, and Weiss (1970). Various applications and extensions of the algorithm are 



discussed in Little and Rubin (1987 ) and, more recently, in McLachlan and Krishnan (1997 ). 

The EM method is a technique for maximum likelihood estimation (MLE) from incomplete 
data. For a parametric family of probability distributions depending on parameter vector 9 
and a given sample of training data from this parametric family, MLE defines the estimate 9 
of 9 as a value of 9 which maximizes the likelihood of the training sample. MLE from observed 
complete data is particularily easy for many statistical problems, thanks to the nice form of the 



1 For example, as noted by |Goodman (1998 ), the computational complexity of his dynamic programming 
algorithm for probabilistic feature-grammars is exponential in the general case. 
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complete-data (log-)likelihood function. The problem the EM algorithm especially addresses 
is the case where the observed data are incomplete. That is, we observe only a function of 
complete data, which themselves are unobserved. Because of this indirect, hidden character 
of the complete data, MLE from incomplete data is difficult. 

In the following, an incomplete-data estimation setting is assumed to consist of 



• a sample space y of observed, incomplete data, 

• a sample space X of unobserved, complete data, 

• a many-to-one function Y : X — ► y s.t. Y(x) = y is the unique observation correspond- 
ing to the complete datum x, and its inverse X : y — > 1 X s.t. X(y) = {x\Y(x) = y} is 
the countably infinite set of complete data corresponding to the observation y, 

• a complete-data specification p$(x) with parameters 9 G 0, 

• an incomplete data specification g$(y) which is related to the complete-data specification 
by marginalization as 

9e(y) = Pe ^- 
xex( y ) 



Let yi,y2, ■ ■ ■ ,Vn be a random sample from y, i.e., values of independently and identically 
distributed (i.i.d.) random variables on y. Let p[f] = ^ we ^p(^)/(w) denote the expectation 
of a function / : Q, — > IR with respect to a probability distribution p on Q. If / is a multivari- 
ate function f(uj',Lo), then the expectation of / with respect to p(uj) is written p[f(u>', •)]. 
Furthermore, let the empirical probability p(y) of an incomplete data type be defined as 

p : y — > IR s.t. p(y) = iV -1 Yli=i &yi,y wnere the Kronecker delta b Vi ^ y = < 1 ' 

" I otherwise. 

Then the incomplete-data log-likelihood L is defined for a random sample from y as a function 
of as 



L{0) = In [] g e (yf») = ^p(y) ]ng e (y) = p[lng e }. 

y&y yey 

The EM algorithm is directed at finding a value 9 of 6 <G 6 that maximizes L as a function 
of 9 for a given random sample from y, i.e., 



9 = argmax L{9) where L(9) = p[lngg] = p[ln \ , Pe{x)]- 
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The summation inside this logarithm can make MLE from incomplete data difficult even when 
complete-data MLE is easy. 

The old idea formalized in the EM algorithm can be stated informally as follows: 1. Replace 
unobserved data values by expected values, 2. perform MLE from the expected complete 
data, 3. recompute the unobserved data expectations using the new parameter estimates, 4. 
reestimate parameters using the new expectations, 5. iterate until convergence of the likelihood 
function. 

The trick of the EM algorithm thus is to solve the incomplete-data estimation problem for 
In ge(y) indirectly by proceeding iteratively in terms of complete-data estimation for lnpg(x). 
Since the x are not observable, lnpg(x) is replaced by its conditional expectation given the 
observed data y and the current fit of the parameter values 9^\ That is, complete-data log- 
likelihood values are constructed from a conditional expectation given the observed data of 
the incomplete data problem and the current value of the unknown parameters (E-step). From 
the thus manufactured complete-data, maximization is simpler and often exists in closed form 
(M-step). Starting from suitable initial parameter values, the E- and M-steps are iterated until 
convergence of the incomplete-data log-likelihood L. 

More formally, let kg(x\y) = p$(x)/gg(y) be the conditional probability of x given y and 
9, then 

yey 

= J2p(y)k 9 [ln9e'(y)} 

yey 

yey xex(y) ke ' Wy) 

= ^2p(y) ^2 k e {x\y)\npe<{y) - ^p(y) ^ h{x\y) In k e >(x\y) 
yey xex( y ) yey xeX(y) 

= p[ke[lnp 9 ']] - p[kg[lnke>]\ 

= Q{6'-e)-H(e'-e). 

Q(6';9), the conditional expectation of the complete-data log-likelihood function \npo/(x) 
given y and 0, then is used as an auxiliary function to construct an EM algorithm via a 
mapping M : 6 — > 6, where each iteration is defined by = M(6^) as follows: 

E-step: Compute Q(9;9^) = p[k g ( t ) [In pg]] 

M-step: Choose to be a value of 9 € which maximizes Q(9; 9^). 



That is, M is a point-to-set map M(0®) = argmax Q(9; 9^). This use of Q as an auxiliary 

flee 

function in the EM algorithm can be justified by the fact that an iterative maximization of 
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Q guarantees that the incomplete-data log-likelihood function L is non-decreasing on each 
iteration of an EM algorithm. This can easily be shown with the inequality 

L(M{6)) - L{9) = (Q{M(9); 9) - Q{9; 9)) + {H{9; 9) - H{M(9); 9)) > 0, for all 9 € 6, 

which follows from the positivity of the difference both in the Q functions (by definition of 
M) and in the H functions (by Jensen's inequality (see Cover and Thomas (1991] ))). That is, 



we have the following proposition, due to Dempster, Laird, and Rubin (1977| ). 



Proposition 4.1 (Dempster et al. (1977), Theorem 1). 



For each EM algorithm, L(M(9)) > L(9), for all 9 G 6. 

Although Q is globally maximized in each M-step, the term H may hinder a straight 



global maximization of L. As a general result for EM algorithms, Wu (1983 ) shows that under 
continuity and differentiability conditions on L and Q, a sequence of EM iterates {£(#(*))} 
bounded from above converges monotonically to a critical point of L. 



Proposition 4.2 ( |Wu (1983 ), Theorem 2). For continuous Q, all limit points of any 



in- 
stance of an EM algorithm are critical points of L, and for continuous and differentiable 
L, a sequence {L(0®)} bounded from above converges monotonically to L* = L(9*) for some 
critical point 9* of L. 

To summarize, the popularity of the EM algorithm is due to its easy computation because 
it relies only on complete-data computations: the E-step involves complete-data conditional 
expectations, and the M-step requires MLE from these completed data. Even if the algorithm 
may converge slowly, it conservatively increases the likelihood function at each iteration and 
in almost all cases converges to a local maximum of L. If a sequence of EM iterates is stuck 
at some critical point which is not a local or global maximum of L, e.g., a saddle point or 
even a local minimum, a small random perturbation will help it to diverge from this critical 
point. If L has several critical points, the convergence properties of an EM sequence will be 
extremely dependent on the choice of the starting value of the sequence of iterates. 

4.3.2 Partial M-Steps: The GEM Algorithm 

As discussed in the last section, one main feature of the EM algorithm is to provide a simplified 
M-step where MLE from complete data rather than from incomplete data is performed. In 
some cases, even this maximization is complicated and does not exist in closed form. An EM 
algorithm involving such a complicated M-step would be computationally unattractive. For 



such cases, [Dempster, Laird, and Rubin (1977|) defined a so-called generalized EM (GEM) 



algorithm where the M-step is only partially computed, i.e, each M-step only increases the Q 
function rather than globally maximizing it. 
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That is, for a GEM algorithm, is chosen s.t. 

Q(0(t+ 1 ) ; o^) > Q(0«;0(*)). 

As shown by Dempster, Laird, and Rubin (1977 ), this condition suffices for increasing the 



incomplete-data log-likelihood at each interation, i.e., Proposition iA also holds for each GEM 
algorithm. However, appropriate convergence of a GEM algorithm does not follow directly 
without further specification on the process on increasing the Q function. For each instance 
of a GEM algorithm, one can either show the general convergence conditions for a GEM 
algorithm as given by |Wu (1983|) to hold, or directly prove convergence of the specific GEM 



instance in question. The latter approach is pursued in Sect. 4.6.2 where we explicitly show 
convergence for a GEM algorithm for log-linear models. 

4.3.3 Partial E-steps and Maximum Pseudo-Likelihood Estimation 

For many cases, a partial computation of the E-step is also useful. These are especially cases 
where the sample space X is too large to be summed over explicitly in the expectations to be 
calculated in the E-steps. The idea here is to replace the intractable probability function with 
respect to which the expectation is taken by a probability function which is more tractable. 
This change in probability functions results in a corresponding change of the likelihood func- 
tion to a pseudo-likelihood function which is now defined with respect to the new tractable 
distribution. Thus from a general optimization-theoretic point of view EM with partial E-steps 
is an example of maximum pseudo- likelihood estimation. 

A theoretical justification for maximum pseudo- likelihood estimation in the context of 
EM is given in |Neal and Hinton (199Sj ) or Gsiszar and Tusnady (1984). In terms of |Nea 



land Hinton (199~§| ), the EM algorithm can be seen as maximizing a joint function T of the 
parameters and of the distributions over the unobserved data. Using an arbitrary distribution 
q over the unobserved variables, J- can be obtained as a lower bound on the incomplete-data 
log-likelihood function L as follows. 



L{6) = ppn Y, P0&)\ 

xeX(y) 

= E 

> p[ q(x) In 1 by Jensen's inequality 
^— ' q(x) 
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Provided that values of x are seen as physical states and the energy of a state is — mpg(x), 
the function F{q, 9) can be seen as analogous to the negative of the "free energy" of statistical 
physics, i.e., the expected energy under q minus the entropy of q. The EM algorithm can be 
interpreted in this framework as alternating between maximizing T as a function of q and 9. 
The E-step maximizes T with respect to q and holds 9 fixed; the M-step maximizes T with 
respect to 9 for fixed q. 

E-step: Set q^ to argmax F(q, 9®). 

q 

M-step: Set 9^ to argmax J"(g( t+1 ), 9). 



Neal and Hinton (1998) show that at a true joint maximization, these iterations are equiv- 



alent to the classical EM iterations defined in Sect. 4.3.1. That is, the maximum in the 
E-step is obtained by taking q( t+1 \x) = kg(t) (x\y), and at this point we have the equality 
^ r (g(*+ 1 ) ) #w) = L(#W). The maximum in the M-step is obtained by maximizing the term 
in T depending on 9, which is in this case p[kg( t ) \bipe]] = Q(9;9^). Since each such E-step 
guarantees that T = L, and since we maximize Q(0',6®) in each M-step, we are guaranteed 
not to decrease L at each combined EM step. 

In a partial E-step, is set to a tractable approximation of k S (t)(x\y), which yields 

the inequality J-(q( t+1 \ < L(0®). In the corresponding M-step, the term in T depending 
on 9 is maximized. Together, these combined EM steps guarantee not to decrease the lower 
bound T on the incomplete-data log-likelihood L at each iteration. Thus, for partial E-steps, 
monotonicity and convergence of the resulting algorithm have to be shown in terms of the 
pseudo-likelihood function T which bounds the true likelihood function L from below. 



4.4 An EM Example: Baum's Maximization Technique 



4.4.1 Basic Concepts 



A special instance of the EM algorithm for MLE of hidden Markov models, i.e., stochastic reg- 
ular grammars, from incomplete data was presented in Baum, Petrie, Soules, and Weiss (197C| ) 



and Baum (1972 ). The form of this algorithm using dynamic programming techniques for ef- 
ficient computation is well-known as the "forward-backward algorithm" (see Rabiner (1989| )). 
Most popular approaches to parameter estimation for probabilistic grammars are based upon 
this technique. Baker (1979| ) generalized this algorithm to the so-called "inside-outside algo- 
rithm", which efficiently estimates the parameters of stochastic context-free grammars (see 
also [Booth and Thompson (1973 ), |Lari and Young (1990| ) and Jelinek, Lafferty, and Mercer 
(1990)). This algorithm can successfully be applied also to other stochastic grammars which 
assume independence of their derivation units of each other. Such models are, e.g, stochastic 
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dependency grammars ( Parroll and Charniak 1992| ) or stochastic lexicalised tree-adjoining 
grammars ([Resnik 1992| ; |Schabes 1992|) . In the following, we will refer to the basic version of 
this algorithm as Baum's maximization technique. 

In the following, we will give a quick review of the basic concepts of Baum's maximiza- 
tion technique. The probabilistic models the algorithm is applied to can be abstracted by 
stochastic derivation models which define a derivation process as a stochastic process as fol- 
lows: Make a stochastic choice at each derivation step and assume the stochastic choices to 
be independent of each other; calculate the probability of a derivation as the joint probability 
of the independent stochastic choices made, and the probability of an input as the sum of the 
probabilities of its derivations. 

More formally, let tt = {itij) £ II be the parameter vector of the probabilistic processing 
model where 7Tjj > and iTij = 1. The variable i ranges over the types of choices that 
the stochastic process makes, and the variable j ranges over the alternatives to choose from 
when a choice of type i is made. Furthermore, let y denote an input of the probabilistic 
processing model, i.e., an observation sequence, and let x denote an output of the model, 
i.e., an analysis, and let Y(x) = y be the unique observation corresponding to analysis x 
and X(y) = {x\Y(x) = y} be the set of analyses of observation y. Finally, let Vij(x) be the 
number of selections of alternative j for a choice of type i in analysis x. The probability of 
an analysis is the joint probability of the stochastic choices made in producing it. Since these 
stochastic choices are assumed to be independent of each other, the probability of an analysis 
is calculated as the product of the probabilities of the stochastic choices made in producing 
it: 



ij 

The probability of an observation is the sum of the probabilities of its analyses: 

xex(y) 

For a given random sample of observations, the purpose of Baum's maximization technique 
is to find maximum likelihood parameter values for the incomplete-data likelihood function 
L where 



The EM mapping M is instantiated here to a particularity simple case. Let kn(x\y) = 
Pir( x )/9Tr(y), then 
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Intuitively, the estimated value of parameter mj is obtained by prorating iVy, the expected 
number of times choice ij is made during the derivation, by Nu, the expected total number 



of times a choice of type i is made during the derivation, for all observations y. Baum, Petrie, 
Soules, and Weiss (1970| ) showed that this algorithm is hill-climbing, i.e., L(M(ir)) > L(tt) for 
all 7r £ II, and that the incomplete-data likelihood L eventually converges to a critical point, 
i.e., to a local maximum. 



4.4.2 Baum's Maximization Technique and Context-Dependence in CLP 

The intuitive appeal and the efficient computability of Baum's maximization technique has 
led to a multiplicity of applications of this technique to various grammar frameworks. Re- 
cently, an attempt to apply this technique to a probabilistic version of the constraint-based 
formalism CUF, which is an instance of the CLP scheme of |Hohfeld and Smolka (1988|) , has 



been presented by Eisele (1994 ). As recognized by Eisele (1994 ), there is a context-dependence 
problem associated with applying this technique to such constraint-based systems. In CLP 
terms, the problem is that incompatible variable bindings can lead to failure derivations, which 
cause a loss of probability mass in the estimated probability distribution over derivations. A 
similar problem appears in every constraint-based system which constrains derivations by 
restrictions dependent of the context of the derivation. Approaches embedding Baum's maxi- 
mization technique into estimation procedures for context-sensitive constraint-based systems 



have been presented, e.g., by Briscoe and Waegner (199^ ), [Briscoe and Carroll (1993| ), |Br 



(T995|) , [Miyata (19961) , Psborne and Briscoe (19971 ) or [Bod and Kaplan (19981) . From an 



optimization-theoretic point of view, all such constraint-based approaches contradict the in- 
herent assumptions of Baum's maximization technique which require that the derivation steps 
are mutually independent and thus the set of licensed derivations is unconstrained. 

This problem of context-dependence is discussed in detail in Abney (1997| ) in connection 



with the so-called Empirical Relative Frequency (ERF) estimation method, which can be seen 
a complete-data version of Baum's estimation technique. He shows that applying this method 
to context-sensitive stochastic attribute-value grammars does not generally yield maximum- 
likelihood estimates. 

In the following, this general argument shall be illustrated with a simple CLP example. Let 



us apply the stochastic derivation model of Sect. [L4J to a simple context-sensitive constraint 
logic program (see Fig. f4.l|) . The stochastic choices of the abstract model correspond to 
application probabilities of definite clauses in the generalized SLD-resolution procedure; the 
alternatives to choose from when an atom is selected in goal reduction are the different 
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clauses defining the selected atom. To indicate a probabilistic parameter 7r«, each clause will 
be annotated by a number ij. 

11 s(Z) <-p(Z)&q(Z). 

21 p(Z) <- Z = a. 

22 p(Z) <- Z = 6. 

31 q(Z) <- Z = a. 

32 q(Z) <- Z = 6. 



Figure 4.1: Constraint logic program 

The relational atom s(Z) is defined uniquely in clause 11. The atoms p(Z) and q(Z) each 
are defined in two different ways, which for the sake of the example are considered to be 
incompatible. This incompatibility together with the variable sharing in the body of clause 
11 introduces context-dependence into the program. For a selection of atom p(Z) one can 
choose between clauses 21 and 22 in a goal reduction step, whereas for a choice of atom q(Z) 
the alternatives to choose from are clauses 31 and 32. 

Suppose we have a training corpus of three queries, consisting of two tokens of query 
yi : s(Z) & Z = a and one token of query yi : s(Z) & Z = b. Each query gets a unique proof 



tree from the program of Fig. 4.1, i.e., a query of type y\ gets a proof tree of type xi, and 



a query of type 2/2 gets one of type £2 (see Fig. 4.2). Note that in the proof trees of Fig. 4.2 



goal reduction and constraint solving are applied in one step. 



xi : 



s{Z)&Z = a 

r, c 

ll,p(Z)&q(Z)&Z = a 

r, c 

21, q(Z)&Z = a 
r, c 



31, Z = a 



s(Z)&Z = b 
J r > c 

11, p(Z) &q(Z) kZ = b 
I r ) c 

22, q(Z) & Z = b 
\r,c 
32, Z = b 



Figure 4.2: Proof trees from constraint logic program 
For parameter estimation according to Baum's method, we must calculate conditional 
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3/3 


3/3 


3/3 


3/3 


hj = 
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2/3 


1/3 


2/3 


1/3 



Table 4.1: Estimation using Baum's maximization technique 



probabilities k(x\y) for x £ X(y). These probabilities will be 1 in each case, since there is a 
unique proof tree for each query. Thus for the calculation of p[Nij] = Plk^Uij]], the expected 
number of occurences of clauses in proof trees, we simply have to count and can ignore the 
respective probabilities of the proof trees. As in an application of the complete-data ERF 
method, for this case Baum's algorithm will give unique parameter estimates Wij = g]^~^] 
in one step (see Table [O]) . 

If we consider the calculation of the probability distribution over the proof trees of such 
a probabilistic CLP model, we see that we cannot simply calculate a product for each proof 
tree. Instead, we have to introduce a normalization constant in order to ensure the sum over 



the sample space of proof trees to be 1. For the program of Fig. 4.1, this partition function 
is taken as the sum of the unnormalized probabilities of the proof trees under the estimated 
model: p^xi) + p iT (x 2 ) = (1 • 2/3 • 2/3) + (1 • 1/3 • 1/3) = 4/9 + 1/9 = 5/9. The normal- 
ized probability distribution over proof trees then is: p'Ax\) = (4/9)/(5/9) = 4/5, p'^{x2) = 
(1/9)/ (5/9) = 1/5. The likelihood V of our training corpus under the normalized distribution 
is: V = (4/5) 2 • 1/5 = .128. However, note that there is no analytical solution to the problem 
of finding parameter values tt' for the clauses of the program of Fig. |4.1| which define p' as 
a probabilistic context-free model on the proof trees of Fig. [4.2| . Rather, what has happened 
here is that we implicitly moved to another family of probability distributions by introducing 
the normalization constant into p' . This new family of probability distributions obviously no 
longer requires the parameter values to sum up to 1 for identical left-hand sides of rules, 
but introduces a normalization constant instead in order to guarantee the function to be 
a probability function. We will acknowledge this family of probability distributions as log- 
linear distributions in the next section. Clearly, we can easily find parameters of a log-linear 
model which assigns a higher likelihood to this sample. We could take for example a param- 
eterization 7r" which assigns tt' 2 \ = 2 and ir"^ = 1 forall ij 7^ 21. This yields a normalized 
probability distribution over the proof trees with p"„(xi) = 2/3, p'! K n{x2) = 1/3 and likeli- 
hood L" = (2/3) 2 • 1/3 = .148. The fact that L" > V clearly contradicts the assumption that 
the parameter estimates n given by applying Baum's estimation technique to a normalized 
context-free probability model yield the desired maximum likelihood values. 
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4.5 A Log-Linear Probability Model for CLP 



As shown in the last section, we cannot simply apply a stochastic context-free derivation 
model to CLP but have to go to more expressive probability models. In fact, we implicitly 
already have made this move in the above example by introducing a partition function into 
the probabilistic context-free model. We will show in the following that acknowledging this 
model as a log-linear model not only opens the possibility to find new consistent maximum 
likelihood estimators but also enables a more flexible parameterization of the probability 
models. 



4.5.1 Motivation 



Log-linear models are widely used in probabilistic modelling but come with different names 
in different applications. The name log-linear is standardly used in contingency table anal- 
ysis (see, e.g, Knoke and Burke (1980| )). The model itself originated under the name of the 
Gibbs- or Boltzmann-distribution in statistical physics as a flexible probability model of equi- 
librium states of physical systems. Jaynes (1957|) interpreted such equilibrium models in a 
more abstract framework and coined the name maximum-entropy model. Log-linear models 
have been applied successfully in the area of image processing, where they are known under 
the name of random fields (see |Geman and Geman (1984|) ). These special log- linear mod- 
els are closely related to other probabilistic network models such as Boltzmann machines 
(see Ackley, Hinton, and Sejnowski (1985| )) or Bayesian networks (see |Frey (1998[ )). Log- 
linear models have been used with effort also in various NLP applications. To name only 
a few, these applications include probabilistic grammar models ( [Mark, Miller, Grenander 



|and Abney 1992; ; Abney 1997 ), word spellings ( Delia Pietra, Delia Pietra, and Lafferty 1997 ), 
machine translation (Berger, Delia Pietra, and Delia Pietra 1996), language modelling ( [Rosen 



fold 1996), prepositional phrase attachment ( |Ratnaparkhi and Roukos 1994 ), part-of-speech 
tagging ( Ratnaparkhi 1996 ), history-based parsing ( Ratnaparkhi 1997| ) , lexical correlations 
( Bccferman, Berger, and Lafferty 1997a ) text segmentation ( |Beeferman, Berger, and Lafferty 
1997b), and text classification (Nigam, Lafferty, and McCallum 1999). 



The popularity of log-linear models is clearly due to the great expressive power they 
provide with very simple means. That is, log-linear models can be seen as an exponential 
family of probability distributions where the probability of a datum is simply defined as 
being proportional to the product of weights assigned to selected properties of the datum. 
Let (-7Tj) be a vector of weights and Vi{uS) the number of times property i appears in datum 
to, for all i = 1, . . . , n, then 
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i=l 

A log-linear form is obtained from this simply by replacing proportionality by a constant 
C = Z^ 1 and parameters 7Tj by log-parameters Aj = In 7Tj, for all i = 1, . . . , n, i.e., taking the 
logarithm of this probability function yields a linear combination of parameters and properties 
and a constant. 

= c lC±*? iu) 
= z- l T\ n 7r? (u) 

J- -1-1=1 ' 



A more general form of log-linear models is obtained by including a fixed initial or ref- 
erence distribution po into the model such that p(u>) = Z~ l e^^ XiUi ^po(uj) and Z = 

Clearly, the main advantage of log-linear models is their great flexibility, which includes 
the normalized models used in Sect. 4.4.2 and even probabilistic context-free models as special 
cases (the normalization constant has value 1 in this case). However, considering CLP, with 
log-linear models we are free to select as properties arbitrary features of proof trees rather 
than being restricted to clauses only. For example, we could take subtrees of proof trees as 
properties. This possibility to combine arbitrary clauses to properties allows us to model 
arbitrary context-dependencies in proof trees. Clearly, linguistically there is no particular 
reason for assuming rules or clauses as the best properties to use in a probabilistic grammar. 
As we will see in Sect. 4.7 , more abstract properties referring to grammatical functions, 
attachment preferences, or other general features of constraint-based parses can be employed 
successfully to probabilistic CLGs. Furthermore, the log-parameters corresponding to these 
properties are not required to constitute a probablity distribution over clauses defining the 
same predicate, i.e., the parameters do not have to sum to 1 for clauses defining the same 
predicate. That is, log-linear models allow us to define a probability distribution over proof 
trees directly rather than indirectly as a joint probability of clause applications as in the 
context-free models above. 



Let us illustrate this with the simple CLP example of Sect. 4.4.2. A training corpus 
consisting of two tokens of query y\ : s(Z) & Z = o and one token of query 1/2 : s(Z) & Z = b 
together with the corresponding proof trees generated by the program of Fig. [O] is depicted 
in Fig. |4.3| . Note that for ease of readability, we will omit in the following figures the labelings 
of nodes and edges of proof trees. 
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2 x yx : lx;/ 2 : 

s(Z)&Z = a s(Z)kZ = b 

p(Z) & q(Z) & Z = a p(Z) & q(Z) & Z = b 

q(Z) & Z = a q(Z) & Z = b 



Z = a 



Z = b 



Figure 4.3: Queries and proof trees for constraint logic program 

To capture the statistics of the training sample of Fig. [D| it is sufficient to define a single 
property which is able to differentiate between the proof tree types. Such a property could be, 



for example, the terminal node Z = a of proof tree x\. Setting the value of the corresponding 
parameter of this single-parameter model to In 2 will yield the desired probability distribution 
p{x\) =2/3, p(x%) = 1/3 with incomplete-data log- likelihood L = .148. 

Another way to understand log- linear models is as maximum- entropy models. From this 
viewpoint we do statistical inference and, believing that entropy is the unique consistent 
measure of the amount of uncertainty represented by a probability distribution, we obey the 
following principle: 

In making inferences on the basis of partial information we must use that probabil- 
ity distribution which has maximum entropy subject to whatever is known. This 
is the only unbiased assignment we can make; to use any other would amount to 



arbitrary assumption of information which by hypothesis we do not have. (Jaynes 



19571) 



More formally, suppose a random variable X can take on values Xk,k = 1, ... ,m and we 
want to estimate the corresponding probabilities Pk,k = 1, ... ,m. All we have are ex- 
pectations of functions fi(X),i = l,...,n. Let these expectations be defined with respect 
to a given empirical distribution pk,k = 1, ... , m on complete data Xk,k = 1,... ,m s.t. 
Yyk=iPkfi( x k) = Yl 1 k=iPkfi( x k),i = 1, • • ■ , n. Then the maximum-entropy principle can be 
stated as follows. 

Maximize the entropy H(p) = — Yl^iPk l n Pk subject to the constraints 
YJk=iPkfi{xk) = T,k=iPkfi( x k),i = !,■■■ ,n and EfeLiPfc = L 



For all pk,k = 1,... ,m which solve the above problem, we get the following parametric 
solution: 
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Pk 



Following Jaynes (1957] ), this result can be derived directly from a constrained optimization 



argument where the parameters are viewed as Lagrange multipliers. That is, by applying 
the standard technique of Lagrange multipliers (see, e.g., |Thomas" and Finney (1996j )) to 
the constrained optimization problem stated in the maximum-entropy principle, the above 
parametric probability model can be derived by solving this constrained optimization problem 
with respect to the probabilities pk- Let A denote the Lagrangian defined by 



rn 



A(p, A) = Y,(Pk ln Pk) ~ (Ao + - 1) 

k=l k=l 

m m 

-M(^2Pkfl(Xk) + ^2pkfl{xk)) 



k=l k=l 



-KC^Pkfn{Xk) + ^Pkfn{Xk))- 



k=l k=l 

Then the first partial derivative of A with respect to the pk is 
d 

-—A = (ln pk + 1)- (A + 1) - Ai/i(x fe ) X n fn{x k ). 

op k 

Now set 

OPk 

then 

p k _ e A o+E?=i Mfi(xh)_ 
Since the sum of all probabilities p^ has to be 1, we have 



k=l k=l 

If we define a partition function Z as 

m 

E?=i Ai/i(*») 



* = E 

k=i 

then 

A = In Z- 1 
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and the maximum-entropy distribution is 



Pk 



To sum up, the parametric form of maximum-entropy probability models can be derived by 
solving a constrained optimization problem with respect to the probabilities p k , k = 1, . . . , m. 
The remaining problem, namely solving this constrained maximum-entropy problem with 
respect to the parameters Aj, i = 1, . . . , n, can be shown to be equivalent to solving a maximum 



likelihood problem for log-linear models. This duality can be stated as follows (see [Berger, 
Delia Pietra, and Delia Pietra (1996| )). The complete-data log- likelihood L c of a random 
sample from a log- linear model p\ on X, with empirical probability p(xk) of the values x k , k = 
1 , . . . , m is defined as 



L C (X) =\nl[p x (x k )P^ = ^2p(x k )ln Px (x k ) 

k=l k=l 

This function is equivalent to the Lagrangian A instantiated to the parametric model p\: 



A(p A ,A) = ^Z- l e x - f{ - Xk) \n{Z~ l e Xf{xk) 

k=l 

n m 



i=l fe=l 
n m 



i=l k=l 

]nZ x + p x [X-f]-p x [X-f]+p[X-f] 
In Z x +p[X-f]. 



L C (A) = lnlip A (x fc f>* 



fe=i 

m 
k=l 

= -]nZ x + p[X-f}. 

Thus, the values A* that solve the constrained maximum-entropy problem with respect to 
the parameters Xi,i = 1,... , n are equivalently a solution to the complete-data maximum 
likelihood problem for the log-linear model p x . 

The more general model which includes an initial or reference distribution po is derived in 
a similar way as the unique parametric probability distribution p that minimizes the Kullback 
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Leibler (KL) distance D(p\\po) between p and a given reference distribution po, subject to 
certain constraints. That is, the generalized log-linear model 

is the parametric solution to the following constraint optimization problem: 

Minimize D(p\\po) = J2u>en Pi 10 ) m po(J,) subject to the constraints 
P[fi] =p[fi],i = ,n and £ wen p(w) = 1. 

For uniformly distributed po(u)), the KL distance D(p\ \p$) is the negative of the entropy H(p), 
minus a constant not involving A: 



D(p\\po) = ^2p(u)lnp(u) -lnpo(^) = -H(p) - K. 

In this case, minimizing the KL distance subject to certain constraints is equivalent to max- 
imizing the entropy subject to these constraints. Furthermore, a connection to a maximum 
likelihood problem can be established for the KL distance miminization problem in a similar 
way as for the maximum entropy problem. 

4.5.2 The Form of Log-Linear Models 

Log-linear probability distributions define the probability of a datum simply as proportional 
to weights assigned to selected properties of the datum. Formally, the parametric family of 
such distributions is defined as follows. 

Definition 4.1 (Log-linear distribution). A log-linear probability distribution p\. u on a 
set is defined s.t. for all u G fi: 

Px . v {u) = z^-V-^poH, 

Z\. v = Yluen e*' v ^po(Lo) is a normalizing constant, 

A = (Ai, ... , A n ) is a vector of log-parameters s.t. AG IR n , 

X = (xij • • • iXn) is a vector of properties, 

v = {i>\,... ,v n ) is a vector of property-functions s.t. for each Vi : £1 — > JN, Vi{u) is the 
number of occurences of property \i i n 

A • v{u) is a weighted property-function s.t. A • v{oj) = Y17=i ^i u i{ u )> 
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Po is a fixed initial distribution. 

For the following discussion, it will be convenient to introduce some further notation. 
Properties will be referred to for most purposes by vectors v of property functions rather 
than by explicit vectors % of properties. Slightly abusing terminology, we will call properties 
both x an d v. 



As in Definition 4.1, a log-linear probability distribution depending on property vector v 
and parameter vector A will be written in subscript notation as p\. v . In case the property 
vector is fixed and clear from the context, the model (resp. the normalization constant) will 
be written p\ (resp. Z\) to indicate the dependence on the parameter vector A. 

Furthermore, it will be convenient to have a recursive definition of log-linear models based 
on weighted property-functions which are extended by additional properties and correspond- 
ing parameters. 

Proposition 4.3. For each weighted property-funtion <j){u)) = A • v{uj), = 7 • p{oj) let 

{ip + (f>)(u>) = i(>(uj) + (j>(u)) be an extended property-function. Then 

Pip +< t,(uj) = Z^ f x e^p (j) (tjj) where Z^ o4> = p^]. 

Proof. 

= (J2 e^ (w)+ * (w Ww)) -1 ^ (w) ^ (w) PoM 

= ^e^p^Z^-^e^e^p^u) 

= Z^iYl e^p^y'e^e^poioo) 

For an extended model with weighted property functions 4>(uj) = A • v(u) and ip(to) = 7 • v{oS), 
written p^+\ , we have accordingly 

p 7+A ( W ) = Z-l x e'-^ +x ^p Q {u) 

= C/Z e (7+A) ' I ' (£j) Po(w)) -1 e (7+A) ' Ku,) Po(w) 
wen 

= (£ e-y^pUu^e-y^pUu) 

= Z loX - l e^p x . u (u>). 
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4.6 Statistical Inference for Log-Linear Models from Incom- 
plete Data 

In the last two sections we argued that a solution to the context-dependence problem in 
probabilistic CLP requires probability models which are more expressive than context-free 
and proposed log-linear models for this purpose. The price we have to pay for this gain in 
expressivity clearly is a gain in complexity of parameter estimation. Furthermore, the gain 
in flexibility due to property selection is an additional complexity factor which calls for an 



automatic solution. Fortunately, Delia Pietra, Delia Pietra, and Lafferty (1997 ) have presented 



a statistical inference algorithm for combined property selection and parameter estimation for 



log-linear models. Abney (1997 ) has shown the applicability of this algorithm to stochastic 
attribute-value grammars, which can be seen as a special case of context-sensitive CLGs. 

This algorithm, however, applies only to complete data. Unfortunately, the need to rely on 
large training samples of complete data is a problem if such data are difficult to gather. For 
example, in natural language parsing applications, complete data means several person-years 
of hand-annotating large corpora with detailed analyses of specialized grammar frameworks. 
This is always a labor-intensive and error-prone task, which additionally is restricted to the 
specific grammar framework, the specific language, and the specific language domain in ques- 
tion. Clearly, for such applications automatic and reusable techniques for statistical inference 
from incomplete data are desirable. 



In the following, we present a version of the statistical inference algorithm of Delia Pietra, 



Delia Pietra, and Lafferty (1997 ) especially designed for incomplete data problems. We present 



a parameter estimation technique for log- linear models from incomplete data (Sect. |4.6.2 ) 



and a property selection procedure from incomplete data (Sect. 4.6.3| ). These algorithms are 



combined into a statistical inference algorithm for log-linear models from incomplete data 



(Sect. 4.6.4 ). Empirical results on experimenting with these algorithms on a small scale are 
presented in Sect. |4.7| . 

This section is based on work presented in shortened form in Riezler (1998a| ). 



4.6.1 Motivation 

Why is incomplete-data estimation for log-linear models difficult? The answer is because 
complete-data estimation for such models is difficult, too. Let us have a look at the first partial 
derivatives of some objective functions which are considered in MLE of log-linear models from 



complete and incomplete data (see Table 4.6.1). The system of equations to be solved at the 



points where the first partial derivatives of the complete data log-likelihood function L c are 
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log-likelihood 


auxiliary function 


complete data 
incomplete data 




^^=ph]-Px[^*] 

°9$p. =p[ky[v i ]-p X [u i ]] 



Table 4.2: Partial derivatives of objective functions for MLE of log-linear models 



zero, i.e., at the critical points of L c , is 

£ Z- l e x - v ^Ui{x) = ^2p(x)ui(x) for aU t = 1,... ,n. 

xeX xeX 

Clearly, because of the dependence of both Z\ and e x ' u ^ on A this system of equations cannot 
be solved coordinate- wise in Aj. This problem is even more severe for the case of incomplete- 
data estimation. The incomplete-data log-likelihood L has its critical points at the solution 
of the following system of equations in Aj: 

J2p(y) Z Z Z X leX ' V(x) < x )=zZP^ E k x (x\y)v i (x)for a lli = l,...,n. 
yey xeX yey xeX(y) 

Here additionally a dependence of k\{x\y) on A has to be respected. However, an application 
of the standard EM theory to incomplete-data estimation of log-linear models only partially 
solves the problem. The equations to be solved to find the critical points of the auxiliary 
function Q(X; A') for a log-linear model depending on A are 

J^Piy) Z^e^Viix) = Y,P(y) E kx(x\y)vi(x) for alH = 1,... ,n. 

yey x&x yey xeX(y) 

Here ky(x\y) depends on A' instead of A. However, the dependency of Z\ and e x ' u ^ on A 
still remains a problem. 

Solutions for the system of equations can be found, e.g., by applying general-purpose 
numerical optimization methods (see |Fletcher (1987| )) to the problem in question. For the 
smooth and strictly concave complete-data log-likelihood L c , e.g., a conjugate gradient ap- 
proach could be used. However, optimization methods specifically tailord to the problem of 
MLE from complete data for log-linear models have been presented by [Darroch and Ratclifi 



(1972) and Delia Pietra, Delia Pietra, and Lafferty (1997). The "improved iterative scal- 



ing" algorithm of Delia Pietra, Delia Pietra, and Lafferty (1997) itself is an extension of the 



"generalized iterative scaling" algorithm of Darroch and Ratcliff (1972 ). In the first algo- 



rithm properties are required to sum up to a constant independent of the complete data, i.e., 
v# = Y17=i u i( x ) = K f° r an x E X, whereas in the latter algorithm v# is allowed to vary 
as a function of x. This property of "generalized iterative scaling" is claimed to improve the 
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convergence rate by increasing the step size taken toward the maximum at each iteration. 
Both iterative scaling algorithms iteratively maximize an auxiliary function A c (-y; A) which is 
defined as a lower bound on the difference L c (7 + A) — L C (X) in complete-data log-likelihood 
when going from a basic model p\ to an extended model p 7 +a- The function A c {^; A) is max- 
imized as a function of 7 for fixed A which makes it possible to solve the following equation 
coordinate-wise in 7$, i = 1, . . . , n: 

px(x)vi(x)e' 1iU *^ = p{x)vi{x) for all % = 1, . . . ,n. 
A closed form solution for ji is given for constant vjl\ otherwise simple numerical methods 



such as Newton's method can be used to solve for the 7^. It is shown in Delia Pietra, Delia 



Pietra, and Lafferty (1997 ) and Darroch and Ratcliff (1972 ) that iteratively replacing 



by AW + jW conservatively increases L c and such a sequence of likelihood values eventually 
converges to the the global maximum of the strictly concave function L c . 

For the case of incomplete-data estimation things are more complicated. Since the 
incomplete-data log-likelihood function L is not strictly concave, general-purpose numerical 
methods such as conjugate gradient cannot be applied. However, such methods can be ap- 
plied to the auxiliary function Q as defined by a standard EM algorithm for log-linear models. 
Alternatively, iterative scaling methods can be used to perform maximization of the auxiliary 
function Q of the EM algorithm. Both approaches result in a doubly iterative algorithm where 
an iterative algorithm for the M-step is interweaved in the iterative EM algorithm. Clearly, 
this is computationally burdensome and should be avoided. 

The aim of this chapter is exactly to avoid such doubly iterative algorithms. The idea of 
our approach is to interleave the auxiliary functions Q of the EM algorithm and A c of iterative 
scaling in order to define a singly-iterative incomplete-data estimation algorithm using a new 
combined auxiliary function. Similar to the case of iterative scaling for complete data, the 
new auxiliary function will be defined as a lower bound on the improvement in log-likelihood. 
This allows for an intuitive and elegant proof of convergence of the new algorithm. Our proofs 
are completely self-contained and do not rely on the convergence of alternating minimization 
procedures for maximum-entropy models as presented by |Csiszar Xj97p or ^ siszar 719891 ) 
or on the regularity conditions for generalized EM algorithms as presented by Wu (1983| ) or 



Meng and Rubin (1993| ). The relat ion of our algorithm to generalized EM estimation and 



maximum-entropy estimation is discussed in Sects. 4.6.2.2 and 4.6.2.3 



4.6.2 Parameter Estimation 
4.6.2.1 General Theory 



Let us start with a problem definition. Applying the incomplete-data framework defined in 
Sect. 4.3.1 to a log-linear probability model for CLP, we can assume the following to be given: 
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• observed, incomplete data y G y, corresponding to a finite sample of queries for a 
constraint logic program V , 

• unobserved, complete data x G X, corresponding to the countably infinite sample of 
proof trees for queries y from V , 

• a many-to-one function Y : X — > y s.t. Y(x) = y corresponds to the unique query 
labeling proof tree x, and its inverse X : y — > 2 X s.t. X(y) = {x\ Y(x) = y} is the 
countably infinite set of proof trees for query y from V , 

• a complete-data specification p\(x), which is a log-linear distribution on X with given 
initial distribution po, fixed property vector x anci property-functions vector v and 
depending on parameter vector A, 

• an incomplete-data specification g\(y), which is related to the complete-data specifica- 
tion by 

9\{y) = Yl px ^- 

xeX(y) 

The problem of maximum-likelihood estimation for log-linear models from incomplete data 
can then be stated as follows. 

Given a fixed sample from 3^ and a set A = {A| p\(x) is a log- linear distribution 

on X with fixed po, fixed v and A G K n }, we want to find a maximum likelihood 

estimate A* of A s.t. A* = argmax L(A) = lnf^y g\(yfW) . 

AeA 

For the rest of this section we will refer to a given vector v of property functions. Furthermore, 
we assume that for each property function some proof tree x € X with Ui(x) > exists, 
and require p\ to be strictly positive on X, i.e., p\(x) > for all x G X. These conditions 
guarantee that p\(x) > for all x G X and for all A G A which is a desirable property in the 
following discussion. 

Similar to the case of iterative scaling for complete-data estimation, we define an auxiliary 
function A(j, A) as a conservative estimate of the difference L(-y + A) — L(A) in log-likelihood. 
The lower bound for the incomplete-data case can be derived from the complete-data case, 
in essence, by replacing an expectation of complete, but unobserved data by a conditional 
expectation given the observed data and the current fit of the parameter values. Clearly, 
this is the same trick that is used in the EM algorithm, but applied in the context of a 
different auxiliary function. From the lower-bounding property of the auxiliary function it can 
immediately be seen that each maximization step of ^(7, A) as a function of 7 will increase 
or hold constant the improvement L(-y + A) — L(X). This is a first important property of a 



4.6 Statistical Inference for Log-Linear Models from Incomplete Data 



83 



MLE algorithm. Furthermore, our approach to view the incomplete-data auxiliary function 
directly as a lower bound on the improvement in incomplete-data log-likelihood enables an 
intuitive and elegant proof of convergence. 

Let the conditional probability of complete data x given incomplete data y and parameter 
values A be defined as 

, , n ( w , s e x ^p (x) 
k\{x\y) =p\{x)/gx{y) - 



Then a two-place auxiliary function A can be defined as follows. 

Definition 4.2. Let A 6 A, 7 £ IR n , f#(x) = v %{ x ) > ^i( x ) = v i{ x ) / u #{ x ) ■ Then 

n 

A( 7 , A) = p[l + k\[j ■ v\ - Px [Y^ new*)], 

i=i 

The particular form of the auxiliary function A and the connection of A and L is discussed 



in detail in Lemmata [O], 4.6, and 4.7 below. Let us first have a look at the extreme value 
properties of A, which are crucial for the iterative maximization of A. 

By considering the first and second derivatives of A, we see that A can be maximized 
directly and uniquely. This can be explained as follows. Suppose the parameters 7 6 !R n 
to be a convex set; the Hessian matrix of A is a diagonal matrix filled only with negative 
elements 

d 2 A( r ,X) = d dA( 7 ;A) = f < if i = j 
djidjj djj dji I else 

and thus negative definite. Unique maximization follows from this since a function whose 
Hessian is negative definite throughout a convex set is strictly concave, and a strictly concave 
function attains a maximum at most one point of a convex set, and thus a critical point is 
necessarily a maximum (see Horn and Johnson (1985| )). 



Proposition 4.4. For each A £ A, 7 £ 1R" : ^(7, A) takes its maximum as a function 0/7 at 
the unique point 7 satisfying for each ji, i = 1, . . . ,n: 



Proof. 



P[k\[vi]} = pfrxWie*"*]]. 



Aa( 7 , A) = ±p[l + ■ v] - p x [p^e^#]] 

r) n 1 
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+^-(- + k x [ li -v i ]-p x [D i eW#])] 
o% n 

p[kxWi}-pxhe liU *]]. 



rj2 O 

— A( 7 ,A) = — flfc^-p^eW] 



= -p\p\[viV#e llU #]\ 
< 0. □ 

From the auxiliary function A an iterative algorithm for maximizing L is constructed. For 
want of a name, we will call this algorithm the "Iterative Maximization (IM)" algorithm. At 
each step of the IM algorithm, a log-linear model based on parameter vector A is extended to 
a model based on parameter vector A + 7, where 7 is an estimation of the parameter vector 
that maximizes the improvement in L when moving away in the parameter space from A. 
This increment 7 is estimated by maximizing the auxiliary function A(j, A) as a function of 
7 and, by Proposition QA[ determined for each i = 1, . . . , n uniquely as the solution 73 to the 
equation pf&A^i]] = p[p\[vie liU *\\. If v# = Yli=l u i( x ) = K sums to a constant independent 
of x G X, there exists a closed form solution for the -yf. 

ji = \- In P^aM] for an i = 1, . . . , n. 

K pxWi\ 

For v# varying as a function of x Newton's method can be applied to find an approximate 



solution (see Sect. 4.8). The IM algorithm in its general form is defined as follows: 



Definition 4.3 (Iterative maximization). Let M. : A — > A be a mapping defined by 

A4(A) = 7 + A with 7 = argmax ^(7, A). 

7GR" 

Then each step of the IM algorithm is defined by 



A (*+i) = m(\W). 
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In order to show the monotonicity and convergence properties of the IM algorithm, we 
first must prove some provisional results. Lemma |4~5| shows that the auxiliary function A(-y, A) 
is a lower bound on the incomplete-data log- likelihood difference L( 7 + A) — L(X). In the first 
inequality we apply Jensen's inequality to the natural logarithm of an expectation. We get a 
simplified form similar to the log-likelihood difference for complete data, modulo an empirical 
distribution over complete data being replaced by the conditional distribution k\(x\y). This 
form is simplified further by omitting the logarithm, using the inequality lnx < x — 1. Further- 
more, a random variable z/# on X is introduced in order to define a probability distribution 
i>i on X. Applying Jensen's inequality to an expectation with respect to i>i in the power of e, 
we arrive at a final simplified form, defining the auxiliary function A. 

Lemma 4.5. A( 7 , A) < L( 7 + A) - L{\). 



Proof. 

L(j + X)-L(X) = ^2 p(y) In g^+xiy) pin g x (y) 

yey y&y 

= p[ln ^(T ] 

= p[ln V ("WW'''), 



> p[ r In ^ 7+ ^ ^ )] by Jensen's inequality 

, e x(.) !lx ' ' ' p ^ 

= PiY] (^T\Qnp>r+\(x) -lnp A (x)))] 
,ex(.) 

= P[J2 (^JY^Z- o \ + lner^+lnp x (x)-ln Px (x)))] 

xe X{.) 9x[ ^' ) 
= p[k x [y ■ v] - mp\[e~ ru }] 

> p[k\[y ■ v\ + 1 — Px[e' y ' u }] since lnx < x — 1 
= p[£; A [ 7 • i/] + 1 - Y,(P^ x ) eTjl=lllDl{x)U * {x) )\ 

x&X 

n 

> p[k\{~/ ■ v] + 1 — (pa(^) 5^ Pi(x)e 7il/# ^)] by Jensen's inequality 

x£X 1=1 



p[fc A [ 7 -^ + l- PA [^^e^#] 
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A( 7 ,A). □ 



Lemma 4.6 shows that there is no estimated improvement in log-likelihood at the origin. 
Lemma 4.6. A(0, A) = 0. 



Proof. 



A(0, A) = p[k x [0 ■ u) + 1 - J2 P\(%) E ^( x ) e °l = °- D 



x£X i=l 



Lemma 4.7 shows that the critical points of A and L as functions of 7 for fixed A are the 
same. 



Lemma 4.7. ^L =0 A(t~/, A) 



f t \ t=0 m+x). 



Proof. 



d_ 

lit 



A(i 7 ,A) 



d_ 

dt 



A(t 7 ,X) 



1 

xex i=i 

mxh ■ y] - J>a(*) E ^e t7iU # (x hv#(m 

x£X i=l ' 

n 

PMi ■ y] - ^(PxWTt^hi^*^)}- 

X&X i=l 
n 

p[kx[j ■ v\ - E(pa(x) 5^fi(a?)7*e°)] 
xex i=i 

p[kxh -v]-pxh- v}}- 



d_ 

dt 



L(i 7 + A) 



P[( E P^OO)"" 1 ^ E ^W*)^ 

p[( E ^(x))- 1 E Px(x)(-e^Z t 
xex){-) xex)(0 

E • «/(x)pa(x) + ^ A e^^7 • f (x))] 



2 

t-yoX 



x&X 



P[- E ^7+A(x)Pi 7 +A[7 ' V]{ E P*7+a(«)) 1 
+ E ^7+A[7-H( E Pi7+A(x)) _1 ] 
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d_ 

If 



= P[-Pfry+X [7 ■ V] + k tj+X [7 • f ]] • 

L(t7 + A) = p[k\[j ■ v] - pxfr ■ v}]. □ 



t=o 



One central result of this section is stated in Theorem 4.8. It shows the monotonicity of 
the IM algorithm, i.e., the incomplete-data log- likelihood L is increasing on each iteration of 
the IM algorithm except at fixed points of Ai or equivalently at critical points of L. 

Theorem 4.8 (Monotonicity). For all A G A: L(Ai(X)) > L(X) with equality iff X is a 
fixed point of Ai or equivalently is a critical point of L. 

Proof. 



L(M(X))-L(X) > A(M(X)) by Lemma p 



> by Lemma 4.6 and definition of Ai. 



The equality L(M(X)) = L(X) holds iff A is a fixed point of Ai, i.e., M(X) = 7 + A with 
7 = 0. Furthermore, A is a fixed point of Ai iff 7 = argmax A(j, A) = 0, 

<^=> for all 7 G IR ra : i = argmax A(t>y, A) = 0, 

ten 

for all 7 G IR n : £\ t=Q A{tj, A) = 0, 



for all 7 G IR" : f t \ t=Q L(tj + A) = 0, by Lemma [47 



A is a critical point of L. □ 



Corollary 4.9 implies that a maximum likelihood estimate is a fixed point of the mapping Ai. 



Corollary 4.9. Let X* = argmax L(X). Then X* is a fixed point of Ai. 

AgA 



Theorem [4. 1Q| discusses the convergence properties of the IM algorithm. In constrast to 
the improved iterative scaling algorithm, we cannot show convergence to a global maximum 
of a strictly concave objective function. Rather we can show convergence of a sequence of IM 
iterates to a critical point of the non-concave incomplete-data log-likelihood function L. The 
central property to show is that all limit points of a sequence of IM iterates are critical points 
of L. 

Theorem 4.10 (Convergence). Let {A^} be a sequence in A determined by the IM Algo- 
rithm. Then all limit points of {A^} are fixed points of Ai or equivalently are critical points 
ofL. 

Proof. Let {X^ kn ^} be a subsequence of {A^} converging to A. Then for all 7 G IR n : 
A(~f,X ikn) ) < A(7 ( HA (fcn) ) by definition of Ai 
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< L{^ kn "> + X {kn) ) - L(X {kn) ) by Lemma p 



= L(A (fe " +1) ) -L(A (fc " ) ) by definition of IM 

< L(X {kn+l) ) - L(X {kn) ) by monotonicity of L(X {k) ), 

and in the limit as n — > oo, for continuous A and L: ^.(7, A) < L(X) — L(X) = 0. Thus 
7 = is a maximum of .A (7, A), using Lemma [O], and A is a fixed point of «M. Furthermore, 



lk\t=o -^(^' ^) = sl(=o + = 0) using Lemma 4.7 , and A is a critical point of L. □ 



From this and Theorem 4.8 it follows immediately that each sequence of likelihood values for 



which an upper bound exists monotonically converges to a critical point of L. 

Corollary 4.11. Let {L(X^} be a sequence of likelihood values bounded from above. Then 
{L(X^} converges monotonically to a value L* = L(X*) for some critical point X* of L. 

Thus, the general properties of the IM algorithm are as follows: The IM algorithm con- 
servatively increases the incomplete-data log-likelihood function L. Furthermore, it converges 
monotonically to a critical point of L, which in almost all cases is a local maximum. And it 
shows a chaotic behaviour in that for functions L with several extreme values, convergence 
will be extremely sensitive to the starting value of a sequence of iterates. 

4.6.2.2 Relation to Generalized EM Estimation 



As discussed in Sect. 4.6. 1| , a direct application of the standard EM theory to log-linear 



models is complicated, since complete-data MLE is complicated for log-linear models. That 
is, a direct application of the EM algorithm to log-linear models always is doubly iterative, 
because the M-step itself involves some kind of iterative scaling procedure. Examples using 
iterative M-steps in MLE of log-linear models for partially classified contingency tables are 



given in Little and Rubin (1987 ). 



Iterative M-steps can be avoided by going to partial M-steps, i.e., to GEM algorithms, 



as shown in Sect. |4.3.2| . In a GEM algorithm, the auxiliary function Q is increased in each 
M-step rather than maximized. That means, if the improved iterative scaling algorithm is 
used in the M-step, a single maximization step on the auxiliary function of this algorithm 



suffices to increase the objective function of this algorithm. Delia Pietra, Delia Pietra, and 



lLafferty (1997b use the auxiliary function A c {^, A) = 1 + p[y ■ u] — PaEJLi v i^ v *\ for the 



objective complete-data log-likelihood function L C (X) = ^Tixex Px( x ) ■ An incorporation 
of this complete-data MLE algorithm into a GEM setting yields the following procedure: First, 
for a given sample from y, the auxiliary function Q for the incomplete-data log- likelihood 
L = In r[j/ey 9\{vT^ ls computed as prescribed by the E-step of the EM theory. Next, A* +1 
is set to increase Q. That is, we perform only a partial M-step. This task can be fulfilled 



by tuning the complete-data auxiliary function A c of Delia Pietra, Delia Pietra, and Lafferty 
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(1997) to a new auxiliary function A for the manufactured objective function Q, and by 
performing a one-step maximization of the complete-data auxiliary function A c . 

E-step: Compute Q(A; A^) = p[k X (±) [lnp\]] for a log-linear model p\. 

M-step: Choose A( i+1 ) s.t. Q(A( t+1 ); A^) > Q(aW;AW), 
i.e., A(* +1 ) = 7 W + A« with 7 W = argmax A(j, \®), 

7£IR 

and i( 7 , A«) = p[l + fc AW [ 7 • i/] - Paw Du=i ^"#]]. 

Note that the auxiliary function A which is constructed by applying the complete-data aux- 
iliary function A c to the manufactured complete-data log-likelihood Q is identical to our 
auxiliary function A as specified in Definition ^L^. From the theory of the improved iterative 
scaling algorithm we can deduce that Q is increased at each M-step of the above procedure. 
Given this, the theory of the GEM algorithm tells us that the incomplete-data log-likelihood L 
also is increased at each GEM step of the above procedure. However, convergence of this com- 
bined procedure has yet to be studied. An intuitive and elegant way to do this is by considering 
the auxiliary function A as a lower bound not only on the manufactured complete-data log- 
likelihood Q but also directly on the incomplete-data log- likelihood L, and prove convergence 
directly from the relation of A to L. This is the approach we took in the last section. 



4.6.2.3 Relation to Maximum-Entropy Estimation 



The improved iterative scaling algorithm can be seen also from the perspective of maximum- 
entropy estimation. [Delia Pietra, Delia Pietra, and Lafferty (1997 ) and Berger, Delia Pietra, 



|and Delia Pietra (1996| ) show a duality between maximum likelihood and maximum entropy 
problems, which can be stated as follows. 



The probability distribution p* with maximum entropy subject to constraints 
p[fi] = ftifilyi = 1) • • • ,n from a distribution p(x) over complete data X is the 
model in the parametric family of log-linear models p\ that maximizes the likeli- 
hood of the training sample X distributed according to p(x). 



Clearly, due to the lack of a distribution p{x) over complete data X, a similar result 
cannot hold for the incomplete-data case. Rather, in each M-step we get a maximum of 
a manufactured complete-data likelihood Q(X;X') = p[k\'\\np\]] which corresponds to a 
maximum-entropy solution subject to constraints from the conditional distribution ky(x\y). 
If the M-steps are partial themselves , i.e., if we use a GEM setting, then we get the following 
"increasing-entropy" theorem: 
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The probability distribution p* that increases the entropy H{p) for any proba- 
bility distribution p subject to the constraints p[fi] = k\/[fi], i = 1, . . . , n from a 
conditional distribution k\'(x\y) is the model in the parametric family of log-linear 
probability distributions p\ with Q(X; A') > Q(X; A). 



4.6.3 Property Selection 

For the task of parameter estimation discussed in the last section, we assumed a vector of 
properties to be given. Clearly, exhaustive sets of properties can grow unmanageably large 
and must be curtailed. An appropriate quality measure on properties can then be used to 
define an algorithm for automatic property selection. 

More generally, property selection can be seen from the viewpoint of model induction. 
That means, selecting prominent properties out of a set of possible properties can be seen 
as incrementally inducing a model that captures only the salient statistical qualities of the 
training data. Such induced models disallow overfitting the training data, which would be the 
case with models with one unique property per training element. Instead, compact models 
allow generalizations to new data and temper the overtraining problem. 



Different approaches to model induction have been presented. For example, Stolcke and 



pmohundro (1994 ) have given a Bayesian approach to inducing the structure of hidden 



Markov models. This approach starts with a hidden Markov model that directly encodes the 
data, and proceeds by incrementally generalizing by merging states according to a Bayesian 
posterior probability measure. This measure trades off the likelihood of the data, which prefers 
overfitting models, against a prior probability, which prefers simpler models. Maximization of 
the posterior probability, i.e., the product of the prior and the likelihood, determines which 
states to merge and when to stop generalizing. 



The property selection approach presented by Delia Pietra, Delia Pietra, and Lafferty 



(1997 ) and Berger, Delia Pietra, and Delia Pietra (199(f ) proceeds from the opposite direction. 



Starting from a uniform distribution over the data, which is encoded by a model with no 
properties at all, properties are incrementally added to the model according to a likelihood 
measure. A naive form of this measure is the improvement in complete-data log-likelihood 
when extending a model by a single candidate property c with corresponding log-parameter 
a. Unfortunately, when a new parameter is added to the parameter vector of the model, 
the optimal values can change for all parameters. Thus the calculation of the likelihood- 
improvement due to adding a single property requires MLE for all parameters. Clearly, this 



is infeasible for models with large parameter spaces. Delia Pietra, Delia Pietra, and Lafferty 



(1997 ) and Berger, Delia Pietra, and Delia Pietra (1996|) propose an approximate solution 



where the complete-data log-likelihood function is maximized directly as a function of a single 
parameter a. That is, the improvement due to adding a single candidate is approximated by 
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adjusting only the parameter of this candidate and holding all other parameters fixed. This 
yields a greedy algorithm which makes it practical to evaluate a large number of candidates 
at each stage of the combined inference algorithm. 

Let us turn now to property selection for log-linear CLP models. For the sake of concrete- 
ness, let properties of proof trees be specified as connected, non-overlapping subtrees of proof 
trees as follows: A property of a proof tree is a connected subgraph of a proof tree, where 
each node of such a subtree has either zero descendants or the same number of descendants 
as the corresponding node of the supertree, and the node sets of every two subtrees in the set 
of properties must not intersect. 

Suppose furthermore that properties can be incrementally constructed by selecting from 
an initial set of goals and from subtrees built by performing a resolution step at a terminal 
node of a subtree already in the model. 

Clearly, an exhaustive set of such properties must be pruned according to some quality 
measure. What could be an appropriate quality measure for the case of incomplete data? 



For a MLE framework, the approach of Delia Pietra, Delia Pietra, and Lafferty (1997 ) and 



Bcrgcr, Delia Pietra, and Delia Pietra (1996| ) offers itself. Unfortunately, we cannot apply 



the approximate solution of maximizing the likelihood as a function of a single parameter a, 
since the incomplete-data log-likelihood L is not concave in the parameters. However, we can 
express a conservative estimate of the likelihood-gain by instantiating the auxiliary function 



A of Definition 4.2 to the extension of a model p\. v by a single property c with parameter a. 

n 

A(a,X) = pll + k^a^-p^Cie^*]} 

i=i 

= p[l + k x [ac] - Px [e ac }} 

since on = a,Cj(x) = c(x),c#(x) = c(x),Ci(x) = 1. 

From this, we can define an estimated likelihood-gain G c (a,X) for a candidate c as follows. 

Definition 4.4. Let X • v(x) be a weighted property function, c be a candidate property, and 
a S IR the log-parameter corresponding to c. Then the estimated gain G c (a, A) of adding 
candidate property c with parameter value a to the log-linear model p\. v is defined s.t. 

G c (a,X)=p[l + k x . u [ac} -px-u[e ac }}. 

Clearly, this estimated likelihood-gain G c (a,X) is a lower bound on the true likelihood- 
gain L(a + A) — L(X) for a parameter a corresponding to a property c. G c (a, A) also is strictly 
concave in the parameters and can be maximized directly and uniquely. 

Proposition 4.12. G c (a,\) takes its maximum as a function of a at the unique point a 
satisfying 

p[kxAc]}=p\px.Ace &c }]. 
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Proof. 



d_ 

da 



G c (a,X) =p[k x . u [c] -p x ., 



c e 



g^G c (a, A) = -p\p x . u [c 2 e ac }] < 0. □ 

Property selection then will incorporate that property out of the set of candidates that 
gives the greatest improvement to the model at the property's best adjusted parameter value. 
Since we are interested only in relative, not absolute gains, a single, non-iterative maximization 
of the estimated gain will suffice to choose from the candidates. This yields a greedy algorithm 
for approximate property selection defined as follows. 

Definition 4.5 (Property selection). Let C be a set of candidate properties, c £ C be 
a candidate property with log-parameter a S 1R, and G c {\) = max G c (a, A) the maximal 

a 

estimated gain that property c can give to model p\. v . Then c is selected in a property selection 

step for model p\. v if c = arg max G c > (A) . 

c'ec 

A reasonable stopping criterion for property selection is to employ cross-validation tech- 
niques. That is, the training corpus from y has to be divided into a training portion and 
a held-out portion. Each candidate property is subjected to maximization of the likelihood 
for both the training portion and the held-out portion. If the likelihood is increasing for the 
training portion, but no longer for the held-out portion, the property is discarded. The idea 
is that at such a point overfitting is indicated for a set of properties that too tighly fits the 
training portion (and its noise) but no longer provides a good statistical model for both the 
training and held-out portion of the training corpus. A similar approach of cross-validation 
can be used to provide a stopping criterion in parameter estimation. 



4.6.4 Combined Statistical Inference 

The IM procedure for parameter estimation (Definition [4.3| ) and the procedure for property 
selection (Definition |4.5| ) can be combined into a statistical inference algorithm for log-linear 



models from incomplete data as shown in Table [4.3| . The initial model of the Combined 
Statistical Inference algorithm is assumed to be chosen according to the respective application. 
For example, po can be chosen as uniform distribution for finite X, or as the estimate resulting 



from an applicaton of Baum's maximization technique to CLP (see Sect. 4.4.2| ) for infinite 



X. After each property-selection step t, a good starting point for parameter estimation is 
a po based upon parameter value a + A® , where a is the parameter value of the selected 
property c that maximizes the gain Gc(a, A^). Note that X is defined as the disjoint union 



4.6 Statistical Inference for Log-Linear Models from Incomplete Data 



Input Initial model po, incomplete-data sample from y. 

Output Log-linear model p* on complete-data sample X = Yl y ey\p(y)>o X{y) 

with selected property function vector v* and log-parameter vector A* = 

argmax L(X) where A = {Aj p\ is a log-linear model on X based on po, v* and 

AeA 
A G H n }. 

Procedure 



1. := po with C(°) := 0, 

2. Property selection: For each candidate property c G C^, compute 
the gain G C (A^) := maxG c (a, A^), and select the property c := 

argmax G C (X^). 

c6C(*) 

3. Parameter estimation: Compute a maximum likelihood parameter value 

A := argmax L(A) where A = {A| Pa 0*0 is a log-linear distribution on X 

AeA 

with initial model po, property function vector v := {y^\ • • • , z4*\ c), 
and A G IR n+1 }. 

4. Until the model converges, set 

t := t + 1, 
go to 2. 



Table 4.3: Algorithm (Combined Statistical Inference) 
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of the complete data corresponding to the incomplete data in the random sample, i.e., X := 
^yey\p(y)>o x (v)- 

Let us illustrate this procedure with a simple CLP example. Suppose our sample program 
is the same as in Fig. 4.1 but with C -constraints taken from a language of hierarchical types. 
The ordering on the types is defined by the operation of set inclusion on the denotations of 



the types and depicted graphically in Fig. 4.4. 




Figure 4.4: Type hierarchy 



Furthermore, suppose we have a training corpus of ten queries, consisting of three tokens 
of query y\ : s(Z) & Z = a, four tokens of yz : s(Z) & Z = c, and one token each of query 
y2 : s(Z) & Z = b, y± : s(Z) & Z = d, and y$ : s(Z) & Z = e. The corresponding proof trees 
generated by the program in Fig. 4.1 are given in Fig. 4.5. Note that queries y±, y2, y3 and 
2/4 are unambiguous, being assigned a single proof tree, while 2/5 is ambiguous. 



A useful first distinction between the proof trees of Fig. 4J3 can be obtained by selecting the 

as properties. These properties allow us to cluster 



two subtrees %l '■ 



and \2 ■ 



the proof trees in two disjoint sets on the basis of similar statistical qualities of the proof threes 
in these sets. Since in our training corpus seven out of ten queries come unambiguously with 
a proof tree including property xii we would expect the maximum likelihood parameter value 
corresponding to property xi to be higher than the parameter value of property \2- However, 
we cannot simply recreate the proportions of the training data from the corresponding proof 



trees as we did in the unambiguous example of Sect. 4.5. Here we are confronted with an 
incomplete-data problem, which means that we do not know the frequency of the possible 
proof trees of query y§. 

Let us apply the IM algorithm to this incomplete-data problem. For the selected properties 
Xi and X2, we have v#(x) = V\(x) + ^2(2?) = 1 for all possible proof trees x for the sample of 
Fig. |4.5| . Thus the parameter updates ji can be calculated from a particularly simple closed 
form 7, 



sequence of IM iterates is given in Table |4.4| . Probabilities of proof 



trees involving property Xi are denoted by pi. Starting from an initial uniform probability of 
1/6 for each proof tree, this sequence of likelihood values converges with an accuracy in the 
third place after the decimal point after three iterations and yields probabilities p\ ~ .259 
and P2 ~ -074 for the respective proof trees. 



4.6 Statistical Inference for Log-Linear Models from Incomplete Data 



3 x Vl : 



1 x y 2 : 



4 x y 3 : 



s(Z)kZ = a 



p(Z) k q(Z) k Z = a 



q(Z) kZ = a 



Z = a 



s(Z)kZ = b 
p(Z)&q(Z)&Z = 6 
q(Z) kZ = b 



Z = b 



s(Z)kZ = c 



p(Z) k q(Z) k Z = c 



q(Z) kZ = a 



Z = a 



1 x y 4 : 



1 x y 5 : 



s(Z)kZ = d 
p(Z) k q(Z) & Z = d 
q(Z) & Z = b 



Z = b 



s(Z)kZ = e 



p(Z) & q(Z) & Z = e 



q(Z) & Z = a q(Z) k Z = b 



Z = a 



Z = b 



Figure 4.5: Queries and proof trees for constraint logic program 



Iteration t 


X? 


At) 

A 2 


(t) 
Pi 


(*) 


L(AW) 











1/6 


1/6 


-17.224448 


1 


In 1.5 


In .5 


.25 


.083 


-15.772486 


2 


In 1.55 


In .45 


.2583 


.075 


-15.753678 


3 


In 1.555 


In .445 


.25916 


.07416 


-15.753481 



Table 4.4: Estimation using the IM algorithm 
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4.7 An Experiment 

In this section we present an empirical evaluation of the applicability of log- linear probability 
models and iterative scaling techniques to constraint-based grammars. We present a computa- 
tionally tractable maximum pseudo-likelihood estimation procedure for log-linear models and 
apply it to estimating a probabilistic constraint-based grammar from a small corpus of LFG 
analyses provided by Xerox PARC. The log- linear models employ a small set of about 200 
properties to induce a probability distribution on 3000 parses where on average each sentence 
is ambiguous in 10 parses. The empirical evaluation shows that the correct parse from the set 
of all parses is found about 59 % of the time. 



This section is based on joint work described in Johnson, Geman, Canon, Chi, and Riezler 
fl999|) . 



4.7.1 Incomplete-Data Estimation as Maximum Pseudo-Likelihood Esti- 
mation for Complete Data 



As we saw in Sect. |4.6| , the equations to be solved in statistical inference of log-linear models 
involve the computation of expectations of property-functions v%{x) with respect to p\(x). 
Clearly it is possible to find constraint-based grammars where the sample space X of parses 
to be summed over in these expectations is unmanageably large or even infinite. 

One possibility to sensibly reduce the summation space is to employ the definition of 
the sample space X := J2 y ey\p(y)>o X(y) used in incomplete-data estimation as a reduction 
factor in complete-data estimation. That is, we approximate expectations with respect to 
the distribution p\(-) on X by considering only such parses x € X whose terminal yield 
y = Y(x) is seen in the training corpus. Furthermore, the distribution g\(y) on terminal 
yields is replaced by the empirical distribution p(y): 

PxW'i] = ^2 Px(x)Mx) 

y&yx£X(y) 

y&y xeX(y) 

~ ^Piv) k\{x\y)vi(x). 
yey x£X(y) 

Clearly, for most cases the approximate expectation is easier to calculate since the space 
Yl y ey\p(y)>o X(y) is smaller than the original full space X. 
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The equations to be solved in complete-data estimation for log-linear models are then 

y~]p(y) ^ k\(x\y)vj{x) = y^p{x)vi{x) for alH = 1, . . . ,n. 

y&y xeX(y) x&X 

These equations are solutions to the maximization problem of another criterion, namely a 
complete-data log-pseudo-likelihood function PL C which is defined with respect to the condi- 
tional probability of parses given the yields observed in the training corpus. 

PL c (A)=ln [] k x {x\yf {x ' v) 

x€X,y<=y 



In the actual implementation described in Johnson, Geman, Canon, Chi, and Riezler 



(1999 ) a slightly different function involving a regularization term promoting small values 
of A onto the objective function was maximized. The maximization equations were solved 
using a conjugate-gradient approach adapted from press, Teukolsky, Vetterling, and Flanncry 



1992). A similar approach to maximum pseudo- likelihood estimation for log- linear models 



from complete data but in the context of an iterative scaling approach can be found in [Berger, 



Delia Pietra, and Delia Pietra (1996|) . 



4.7.2 Property Design for Feature-Based CLGs 

One central aim of our experiment was to take advantage of the high flexibility of log-linear 
models and evaluate the usefulness of this issue in hard terms of empirical performance. 

The properties employed in our models clearly deviate from the rule or production proper- 
ties employed in most other probabilistic grammars by encoding as property-functions general 
linguistic principles as proposed by Alshawi and Carter (199"4| ), [Srinivas, Doran, and Kulick 



|(1995| ) or Hobbs and Bear (1995[ ). The definition of properties of LFG parses refers to both the 



c(onstituent)- and f(eature)-structures of the parses. Examples for the properties employed 
in our model are 



• properties counting the number of adjuncts, arguments and segments in an analysis, 

• properties corresponding to grammatical functions used in LFG, including SUBJ, OBJ, 
OBJ2, COMP, XCOMP, ADJUNCT, etc. 

• properties measuring the complexity of the phrase being attached to, thus indicating 
both high and low attachment, 

• properties indicating non-right-branching of nonterminal nodes, 

• properties indicating non-parallel coordinate structures, 

• properties for atomic attribute-value pairs in feature structures, 
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• properties for particular syntactic structures such as date-NPs, 

• standard rule-properties. 

The number of properties defined for each of the two corpora we worked with was about 
200 including about 50 rule-properties respectively. 

We would also have liked to have included properties corresponding to lexical-semantic 
head-head relations, but found the small size of our training corpora to be an obstacle in 
estimating the associated parameters accurately. 



4.7.3 Empirical Evaluation 



The two corpora provided to us by Xeroc PARC contain appointment planning dialogs (Verb- 
mob il corpus, henceforth VM-corpus), and a documentation of Xerox printers (Homecentre 
corpus, henceforth HC-corpus). The basic properties of the corpora are summarized in Ta- 
ble |4.5| . The corpora consist of a packed representation of the c- and f-structures of parses 
produced for the sentences by a LFG grammar. The LFG parses have been produced auto- 
matically by the XLE system (see Maxwell III, and Kaplan (1989| )) but corrected manually in 
addition. Furthermore, it is indicated for each sentence which of its parses is the linguistically 
correct one. The ambiguity of the sentences in the corpus is 10 parses on average. 





VM-corpus 


HC-corpus 


number of sentences 


540 


980 


number of ambiguous sentences 


314 


481 


number of parses of ambiguous sentences 


3245 


3169 



Table 4.5: Properties of the corpora used for the estimation experiment 



In order to cope with the small size of the corpora a 10- way cross-validation framework has 
been used for estimation and evaluation. That is, the sentences of each corpus were assigned 
randomly into 10 approximately equal-sized subcorpora. In each run, 9 of the subcorpora 
served as training corpus, and one subcorpus as test corpus. The evaluation scores presented 



in Tables 4.6 and 4.7 are sums over the the evaluation scores gathered by using each subcorpus 
in turn as test corpus and training on the 9 remaining subcorpora. 

We used two evaluation measures on the test corpus. The first measure C tes t(A) gives 
the accuracy of disambiguation based on most probable parses. That is, Ct es t(A) counts the 
percentage of sentences in the test corpus whose most probable parse according to a model 
Px is the manually determined correct parse. If a sentence has k most probable parses and 
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one of these parses is the correct one, this sentence gets score 1/k. The second evaluation 
measure is —PL test (X), the negative log-pseudo-likelihood for the correct parses of the test 
corpus given their yields. This metric measures how much of the probability mass the model 
puts onto the correct analyses. 

In the empirical evaluation, the maximum pseudo-likelihood estimator is compared against 
a baseline estimator which treats all parses as equally likely. Furthermore, another objective 
function is considered: The function C_^(A) is the number of times the highest weighted parse 
under A is the manually determined correct parse in the training corpus X. This function 
directly encodes the criterion which is used in the linguistic evaluation. However, C^{\) is 
a highly discontinuous function in A and hard to maximize. Experiments using a simulated 
annealing optimization procedure ( Press, Tcukolsky, Vcttcrhng, and Flanncry 1992| ) for this 
objective function showed that the computational difficulty of this procedure grows and the 
quality of the solutions degrades rapidly with the number of properties employed in the model. 



The results of the empirical evaluation are shown in Tables 4.6 and 4.7. The maximum 
pseudo-likelihood estimator performed superior to both the simulated annealing estimator and 
the uniform baseline estimator on both corpora. The simulated annealing procedure typically 
scores better than the maximum pseudo-likelihood approach if the number of properties is 
very small. However, the pseudo-likelihood approach outperforms simulated annealing already 
for a property-size of 200 as used in our experiment. Furthermore it should be noted that the 
absolute numbers of 59 % accuracy on the disambiguation task have to be assessed relative 
to a number of on average 10 parses per sentence. 





Ctest for VM-corpus 


—PL tcst for VM-corpus 


uniform baseline estimator 


9.7 % 


533 


simulated annealing estimator 


53.7 % 


469 


maximum pseudo-likelihood estimator 


58.7 % 


396 



Table 4.6: Empirical evaluation of estimators on Ct es t (accuracy of disambiguation with most 
probable parse) and —PL test (negative log-pseudo-likelihood of correct parses in test corpus) 
on VM-corpus 



4.8 Approximation Methods 

With the algorithms and proofs of the preceding sections in hand, it seems that statistical 
inference of log-linear models from incomplete data reduces to solving simple equations and 
computing expectations of simple functions. However, depending on the size of the sample 
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Ctest for HC-corpus 


—PL tes t for HC-corpus 


uniform baseline estimator 


15.2 % 


655 


simulated annealing estimator 


53.2% 


604 


maximum pseudo-likelihood estimator 


58.8 % 


583 



Table 4.7: Empirical evaluation of estimators on HC-corpus 



spaces over which these expectations must be taken and depending on the complexity of the 
parameter- and property-space, these equations can become intractable both analytically and 
numerically. In order to give a self-contained recipe for statistical inference of log-linear models 
from incomplete data, we will discuss the possibilities of applying various approximation 
methods to achieve both analytical and computational tractability in complex applications. 



4.8.1 Enforcing a Closed-Form Solution 

As mentioned above, if the property-functions sum to a constant independent of x, i.e., if 

n 

v#(x) = fi(x) = K for all x G X, 
i=i 

then the maximum 7 of the auxiliary function A used in parameter estimation is given in 
closed form. 

For a given vector of property-functions v with v#{x) = K, the IM algorithm can be 



stated as shown in Table 4.8. Note that the complete-data sample X is computed as X = 

^yey\p(y)>o X (y)- 

In this case, the IM algorithm can be seen as an incomplete-data version of the generalized 
iterative scaling algorithm of Darroch and Ratcliff (1972] ). 



If the constancy-condition is not fulfilled, it can be enforced by introducing a "correction" 
property-function V[ as follows: 

Choose K = max l6 ^ v #i x ) an d vi(x) = K — v#(x) for all x G X, 
then Yl\=i u i( x ) = K f° r an x £ X. 

Unfortunately, defining a correction property can be expensive, e.g., in case a property selec- 
tion procedure is used in statistical inference, a correction property has to be defined after 
each property selection step. 

Correction properties can be avoided by letting v# vary over x G X. This approach is also 
claimed to improve the convergence rate of iterative scaling methods by increasing the step 
size taken toward the maximum at each iteration. 
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Input Initial model pg, property- functions vector u, incomplete-data sample from 

y. 

Output MLE model p x * on X = Y, y ey\p{y)>o X ^ ■ 
Procedure 



Until convergence do 

Compute p\, k\, based on A = (Ai, . . . , A n ), 

For i from 1 to n do 

„, ._ i 1ti T: y€y P(y) E l6 xM MjvKM 



Aj := Ai + 7i, 
Return A* = (Ai, . . . , A n ). 



Table 4.8: Algorithm (Iterative Maximization, Closed-Form) 



4.8.2 Numerical Approximation via Newton's Method 

If v#(x) does not add up to a constant for all x G X, the solutions to the maximization 
equations in parameter estimation and property selection cannot, in general, be determined 
in closed form. Fortunately, numerical methods such as Newton's method can be used to 
efficiently compute approximate solutions to these equations. 

Newton's method approximates the solution a of an equation f(a) = by using a sequence 
of linearizations of /. At each step, the intersection of the tangent to / at at with the a-axis 
is taken, yielding an improved estimate att+i- The iteration formulae to approach the solution 
up to a desired accuracy are defined as follows. 

oit+i = at — p^t) wnere f'i a t) is the derivative of / at at- 



This method directly suits our application when we replace f(a) by the first derivative 



of the auxiliary function A, -^-A^j, A), in case of parameter estimation, and by the first 



derivative of the approximate gain G c , -J^G c (a, A), in case of property selection. Newton's 
method usually converges rapidly for such functions. 

To efficiently compute the functions in the Newton formulae, we can use a cashing tech- 



nique similar to the one used in |Abney (1997|) and apply it to our incomplete-data problem. 



First, we have to define tables of total probabilities as follows. 

• Si tV = ^2 x& x P\{x)5 Vi ( x ),v is the expexted number of times property function Vi takes 
value v, 
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• Ti tV = ^2 x€ x(y) k\{ x \y) v i( x ) i s the conditionally expected number of times property Xi 
occurs, 

e#| zy # (x)=m.^A(a;)^(^) is the expected number of times property Xi occurs 
when there is a total number of m property instances. 

Corresponding to these expectations, we define the following counting variables: 

• s r (a,i) = Y.v S i,ve av v r 1 

• t r (a,i) = J2 y T i, y a r , 

• u r (a,i) = T, m u i,me am m r . 

The Newton formulae for property selection can then be filled with these expected counts as 
follows: 



i dat 

a t+ \ = a t + 



3§-G c (a t ,A) 



= a t + 
= a t + 



p[k x [c] -Np x [ce atC ] 

Np x [c 2 e a * c ] 
t (a t ,c) - Nsi(a t ,c) 
Ns 2 (a t ,c) 



The tables of total probabilities defined above also allow us to express the gain G c (a,X) 
of adding property c with best parameter value a to model p\ in terms of expected counts: 



G c (a,X) = N+p[k x [ac]-Np x [e ac ] 
= N + ti(a, c) — Ns (a,c). 

For the task of parameter estimation, similar Newton formulae can be obtained from the 
expected counts: 



Oit+l 



a t + 



a t + 



aM(7,A) 

>(7,A) 



da 



= a t + 



Np x [viV # e a ^*] 
t (a t ,i) - Nuo(a t ,i) 
Nui(a t ,i) 



4.8 Approximation Methods 



103 



Input Initial model po, property- functions vector v, incomplete-data sample from 

y. 

Output MLE model Px * on X = Y!, y ey\p(y)>o x (v)- 
Procedure 



Until convergence do 

Compute tables T, U, based on A = (Ai, . . . , A n ), 
For i from 1 to n do 
a := 0, 

Until a is accurate enough do 
u := 0, ui := 0, t Q := 0, 
For m from to m max do 

u := u + a, 
u\ := u\ + am, 
For y £ y where p(y) > do 
b ■= T i>y , 
t := t + b, 
a:=a + ^, 
Xi := Aj + a, 
Return A* = (A l5 . . . , A n ). 



Table 4.9: Algorithm (Iterative Maximization, Newton-Estimate) 
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For a random sample from y of size N, an algorithm for approximate parameter estimation 
can be denned from the above Newton formulae as shown in Table |4.9|. 



Similarly, an algorithm for approximate property selection can be given as in Table 4.10 



Input Model p\, set of candidate properties G, incomplete-data sample from y. 

Output Selected property c* with maximal parameter value a*. 

Procedure 

Compute tables S, T, based on A, 
G* := 0, c* := 0, a* := 0, 
For all candidates c € C do 
a := 0, 

Until a is accurate enough do 

s := 0, s\ := 0, s 2 := 0, t := 0, t\ := 0, 
For v from to v max do 

d . — S c ^ v e , 

so := so + a, 

si := si + av, 

s 2 := s 2 + aw 2 , 
For y £ y where p(y) > do 

to := to + b, 
t\ := ti + ba, 

a . = a+ tp-Nsi 
G := N + ij - iVs , 

If G > G*, then G* := G, c* := c, a* := a. 
Return c*, a*. 



Table 4.10: Algorithm (Property Selection, Newton-Estimate) 



4.8.3 Approximating Expectations via Monte Carlo Methods 

Independent of whether the solutions of the maximization equations exist in closed form, 
a further problem arises in connection with large or infinite sample spaces. That is, if the 
sample space X is too large to be summed over in the calculation of the expectations in the 
maximization equations, methods must be used to approximate these expectations. 



One possibility is to use Monte Carlo Methods. Following Abney (1997 ), we use the 
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Input Initial state xq € X(y), 

Nominating matrix p' = p n (x) on X(y), 

Log- linear distribution p = p\{x) on X(y), 

. f 1 if P{x)p' (z) < p{z)p'{x) 

Evaluate mate c„ = | ^ , f > 

Terminal number of steps k. 
Output Random sample Xq, . . . , Xi~ from p\ on X(y). 
Procedure 



X := xq, 
i:=l, 
While i < k 

x := Xi-i, 

Randomly generate z from p' , 
If z = Xi^i , then Xi := X{—i, 
Else evaluate ct x>z , 

Randomly generate u from uniform distribution on [0, 1], 
If u < a X)Z , then Xi := z , 
ElseX, 
i := i + 1, 
Return Xo, . . . , X^,. 

Table 4.11: Algorithm (Metropolis-Hastings Sampling) 



Metropolis-Hastings method and show how it can be applied to our incomplete-data problem. 

The strategy behind this method is to generate a random sample from a target distribution 
p by choosing a nominating matrix p' from which sampling is easy, and performing a Bernoulli 
trial with parameter a to determine whether to accept or reject the nominated sample point. 
That means, this method converts a sampler for p' into a sampler for p via an evaluation 
matrix a. For our application, we can take as nominating matrix for each query y € y 
a stochastic context-free CLP model p^x) on X{y) as defined in Sect. 4.4.2| . From this 
stochastic derivation model sampling is easy and can be converted by a standard evaluation 
matrix to sampling from the desired log- linear distribution p\{x) on X(y). 



Following standard textbooks such as [Fishman (1996[) , an application of the Metropolis- 
Hastings algorithm to our problem is as shown in Table [4.11 . 



Note that the evaluation matrix a XyZ reduces to a particularly simple form for our appli- 
cation which does not require the computation of normalization constants Z\. That is, by 



106 



Chapter 4. Probabilistic CLP 



taking the initial model po of the log-linear CLP model p\ to be of the form of a stochastic 
CLP model p n , and by assuming independence of the nominated sample points, we get the 
following form of a xz : 



- llun I i P( z )p'( x ) \ where P(z)p'(x) _ P\(z)p n (x) 

* ' p{x)p'{z)) p{x)p'{z) P\(x)p n (z) 



Z- l e^)p„(x)p % (z) 

e X-u(z) 



,\-v(x) 



It can be shown for this sampling method that the distribution of the i.i.d. random vari- 
ables Xi converges in distribution to the target distribution p\ as i — > oo: 

lim P(Xi = x) = p\(x) for all x € X(y). 

i— »oo 

Furthermore, a proper random sample from a probability distribution p enables to estimate 
expectations of functions / with respect to p directly from the sample points Xj. That is, the 
estimated expectation converges to the true expectation with probability 1: 



1 K 

lim — y f{Xi) = y f(x)p(x) with probability 1. 



= 1 



Applying the Metropolis-Hastings algorithm to a log-linear model for CLP yields for each 
y £ y where p{y) > a random sample X{y) from p\ on X{y). Such samples can be combined 
into a sample X = X^ey|p(j/)>o X(y) from p\ on X . From these random samples the desired 
estimates of expectations of functions with respect to p\ can be computed. 

Note that we can use the same random sample for each iteration of Newton's method 
to estimate the gain for each candidate property simultaneously. After adding the selected 
property to the model, again a single random sample from the extended model can be used to 
estimate the MLE values for each parameter in parallel. Suppose we have a random sample 
from y of size N, a complete data sample X(y) of size M y for y, and combined complete data 
sample X = ^ yG y\p^ y \ >0 X(y) of size L. Then we can define tables similar to the tables of 



total probabilities used in Sect. |4.8.2| as follows 



• Si >v = YlxeX ^vi(x),v i s t ne number of times property function takes value v in com- 
bined random sample X, 

• Ti tV = Yli X £X{y) u i{^) 1S t ne number of times property Xi occurs in random sample X(y), 
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• Ui. m = J2xtx\ u^(x)=m u ii%) 1S the number of times property Xi occurs in combined 
random sample X when there is a total number of m property instances for each sample 
point. 

For the expectations involved in the closed-form updates in parameter estimation of Sect. 



4.8.1, the following counting variables will be convenient: 



• t(i) = Y, y Ti, y M-\ 

The closed-form parameter update 7 can then be approximated by random sampling as fol- 
lows. 



1 . t(i) 
7i « — In ■ 



For the expectations involved in the Newton formulae, the counting variables are the same as 



those of Sect. [4.8. 4 except for 
• t r (a,i) = V ;; //.;/' 1 r My '. 
The Newton update used in property selection is approximated by random sampling as follows. 



ctt+i ~at+ ' 



io(at,_c) - %si(a t ,c) 



T s 2 {a t} c 
The gain is approximated as 



N 

G c {a, A) pa N + h(a, c) - — so (a, c). 

Li 

Similar random sampling estimates can be obtained for the Newton update used in parameter 
estimation: 



at+i ~ a t + 



t (a t ,i) 



^u (a t ,i) 
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4.8.4 Approximating Expectations via Maximum Pseudo-Likelihood Esti- 
mation 

As stated above, Monte Carlo methods offer the theoretical assurance that the approximation 
of an expectation converges to the true expectation in the limit. This means that one can get 
arbitrarily close to the true value of the expectation with increasing sample size. However, 
convergence can be very slow, i.e., the sample size necessary for an appropriate approximation 
may be very large. This is especially the case if the distributions of the nominating model 
Ptt and the target model p\ are far apart. This may be the case if probabilistic context-free 
grammars are used as nominating model for a log-linear model on constraint-based grammars. 
Besides the compensation for sampling errors, many samples may have to be generated to 
guarantee a reliable estimate of the desired expectations. Together, these problems can make 
Monte Carlo approximations infeasible in practice. 

An alternative to Monte Carlo methods is to approximate expectations in a maximum 



pseudo-likelihood estimation framework. In Sect. |4.3.3| we introduced partial E-steps in the 
EM algorithm as in instance of maximum pseudo-likelihood estimation. The idea was there 
to replace an intractable probability function with respect to which an expectation is taken 
by a probability function which is more tractable. One possibility to achieve such tractable 
expectations is to use sparse expectations: Instead of replacing the intractable sample space 
by a Monte-Carlo sample and counting from this, the original sample space is restricted to 
an appropriate finite subset over which the expectation is calculated. 



The general form of such sparse approximations is as follows (cf. |Neal and Hinton (1998Q ) 



Let S(y) be a finite subset of the set X(y) of complete data corresponding to an incomplete 
datum y G y. Then a sparse conditional distribution s\{x\y) on complete data x given 
incomplete data y and the current value of the parameters A can be defined s.t. 

if x &S®(y), 

\\(t)(x\y) = { p A w(*) 



if x G S®(y). 



That is, for a given subset S®(y) of the sample space X(y) defined at time t, the sparse 
probability distribution s X ( t )(x\y) is defined as the normalized probability distribution that 
assigns a positive probability only to the elements in S^(y). The calculation of expectations 
Y?,xeSW(y) S A(*) ( x \u)f( x ) °f functions f(x) with respect to s X ( t )(x\y) then only takes time 
proportional to the size of S®(y) at time t. 

Various heuristics can be used for a flexible definition of S^(y). A sensible approach is 
to define S^(y) as the N most probable x G X(y), and recalculate this set at each step t, 
and frequently perform a full iteration with S^\y) = X{y) for all y G y with p{y) > 0. For 
N = 1, this approach yields the well-known Viterbi- approximation of the EM algorithm. Here 
each y is assumed to come with a unique x G X(y) at time t. Given algorithms for efficiently 
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searching for the most probable proof tree x for a given query y, a Viterbi-approximation 
can be defined for parameter estimation of a probabilistic CLP model. A recursive use of this 
algorithm also enables an N-best- approximation. 

A linguistically motivated definition of S^(y) as the trees x € X(y) of a context-free 
grammar which correspond to a bracketing structure annotated to the sample of training 
sentences has been presented by Pereira and Schabes (1992). Since the bracketing does not 



change during the estimaton process, S^\y) is constant for all t. Clearly, such bracketing 
constraints yield on the one hand better linguistic results in terms of a constituent structures 
of trees consistent with hand-annotated bracketings. On the other hand, the restriction of the 
sample space to the x £ X which correspond to the bracketing structure of the sample from 
y also reduce the computational load of the estimation process. 

A general form of the IM algorithm using sparse approximations s\(x\y) is given in Table 



4.12. 



Input Initial model po, initial set So{y), property- functions vector v, incomplete- 
data sample from y. 

Output Approximated MLE model p\* on X = Y^,yey\p(y)>o X(y)- 
Procedure 



Until convergence do 

Compute S(y), p\, s\, based on A = (Ai, . . . , A n ), 
For i from 1 to n do 

Aj := Aj + 7j, 
Return A* = (Ai, . . . , A n ). 



Table 4.12: Algorithm (Sparse Iterative Maximization, Closed- Form) 



A theoretical justification of such approaches can be given in terms of partial expectations 
in the context of the EM algorithm. In Sect. 4.3.5 , we saw that the incomplete-data log- 
likelihood L(X) = p[\n g\{y)] for a given random sample from y is lower bounded by a pseudo- 
likelihood function J-(q, A) which is a joint function of the parameters and of the distributions 
over the unobserved data. The function q can be set to a tractable sparse approximation s A ( t ) 
of k X ( t ). Thus a sparse distribution s A ( t ) yields a lower bound J r (s A ( t ), A^) < L(A^) in the E- 
step, which is maximized as a function of A in the M-step. As shown by |Neal and Hinton (199§1 ) 
or Csiszar and Tusnady (1984), even if some iterations may decrease L, we are guaranteed 
that the pseudo-likelihood T which bounds L from below is increased or held constant with 
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every iteration. The interpretation of the IM algorithm as an instance of a GEM algorithm 



given in Sect. |4.6.2.2| thus justifies a replacement of k X ( t ) by a sparse approximation s A ( t ) also 
for an IM algorithm. 

However, it has to be kept in mind that for such partial E-steps monotonicity and con- 
vergence of the estimation algorithm has to be proven in terms of the lower bound J- on L. 
Clearly, convergence can be shown easily for approaches with constant (y) for all t induced, 
e.g., by fixed bracketing constraints, but is hard to verify for approaches which let S^\y) vary 
as a function of t such as Viterbi-approximations. More subtle versions of pseudo-likelihood 
approaches to EM include variational approximation methods, where a parameterized ap- 
proximating distribution q is used and the parameters are varied to minimize the Kullback 
Leibler distance between q and k\. Minimizing this distance clearly results in a minimization 
of the distance between the pseudo-likelihood function T and the true likelihood function L. 

q{x) 

= p[ln# A (-) - ^ l(x){lnk x {x\-) +lng x (-) -lnq(x))} 

xex(-) 

= P[ ^ q(x)ln qi ' ' 



= P[D(q\\kx)} 

The parametric models used, e.g., in the context of large-scale neural networks, are models 
assuming complete independence of the variables of the network (mean field approximation, 
see [Parisi (1988 )) or approximated models probabilistic dependencies of the original model 



(structured variational approximation, see |Saul and Jordan (1996D ). Possible applications 
of variational approximation to estimating probabilistic CLGs could follow these lines. A 
discussion of such approaches yet is beyond the scope of this thesis. 



4.9 Parsing and Searching 

In the foregoing chapters we discussed the mathematical and algorithmic details of statistical 
inference of log-linear models from incomplete data, and experimented with these techniques 
on a small set of real-world data of parses of a constraint-based grammar. On this small 
scale it was possible to do ambiguity resolution by explicitly listing all parses according to 
the induced probability distribution and picking the most probable one as the correct one. 
However, for applications on a larger scale an important question is how the structure of 
the probability model on parses can be used to guide the search for the most probable parse 
efficiently without having to list all parses explicitly. Thus the question is whether the search 
techniques standardly used for probabilistic grammars can be re-applied to the log-linear CLP 



4.9 Parsing and Searching 



111 



and CLG models. 



We begin our discussion in Sect. 4.9.1 with an application of the tabular parsing method 



of Earley deduction (Pereira and Warren 1983) to CLGs. The table of pending derivations 
defined in this method will lay the ground for probabilistic search methods for finding most 



probable parses. In Sect. |4.9.2| we show that the probabilistic search method of the Viterbi 
algorithm ( [Viterbi (1967| ), [Forney (1973| )) standardly used in context-free tabular processing 
models finds the most probable parse of a probabilistic CLG model only under certain re- 
strictions. Since such restrictions may trade off against the search complexity, methods for 
sensibly relaxing the restrictions are desirable. A heuristic search algorithm resulting from 



such a relaxation is discussed in Sect. 4.9.3 



4.9.1 Earley Deduction for Feature-Based CLGs 



Earley deduction has been introduced by Pereira and Warren (19831 ) as a generalization of 



Ear ley's efficient context-free parsing algorithm (Earley (197C), Aho and Ullman (1972| )) to a 



tabular parsing algorithm for definite clause grammars. In contrast to backtracking methods, 
in tabular parsing methods a table, or chart, of pending subderivations is built up during 
derivation. In Earley deduction, subderivations correspond to definite clauses derived from 
the grammar axioms and a query. Storing such derivation states for future use as items in a 
chart may avoid the redundancy of backtracking methods which leads in the worst case to 
an exponential search complexity. Instead, this dynamic-programming technique of storing 
solutions to subproblems may reduce the search complexity to be polynomial in input length. 

The very basic concepts of an application of Earley deduction to CLP can be given as 
follows. Earley deduction works on two sets of definite clauses, the set of program clauses 
V and the set of derived clauses constituting the chart C. An active item of a context-free 
Earley parser corresponds here to a definite clause with at least one relational atom on its 
righthandside, i.e., to a non-unit clause. Passive items correspond to clauses whose righthand- 
sides consist only of an C -constraint, i.e., to unit-clauses. A selection function determines for 
each non-unit clause its selected 11(C) -atom. We adopt here the standard Prolog selection 
rule where the first atom on the righthandside of a clause is selected in each step. The input 
to the algorithm consists of a set of program clauses V and a query G. The content of the 
chart C initially consists of G and is continually added to by an exhaustive application of the 
following two inference rulesQ (the rules are to be read as "If there are clauses c\ and C2 and 
the conditions on these clauses are satisfied, then add clause C3 to the chart."). 



2 1 



Prediction is called "instantiation" in Pereira and Warren (1983) and completion corresponds to their 



"reso-lution" . In context-free Earley parsing standardly a distinction between "predictor", "scanner" and 
"completer" operations is made. The first operation corresponds to prediction and the latter two operations 
are subsumed by the completion operation of the Earley decuction framework defined below. 



112 



Chapter 4. Probabilistic CLP 



Prediction: 

ci = (#i <- 5i) € C 

c 2 = (ff 2 <- 5 2 ) 

c 3 = (5 «_ U <f>) £ C 

where ci is non-unit, c 2 is unit or non-unit, S 1 is the selected atom in B\, (f> is 
the /^-constraint in B±, and there exists a variant c' 2 = (S 1 <— £? 2 ) of c 2 s.t. 
V(ci) fl V(B' 2 ) C V(5) and the /^-constraint <// of C3 is satisfiable. 
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Completion: 

ci = (ffi «- fl x ) G C 
c 2 = (iT 2 <- S 2 ) € C 

c 3 = {H x <- (Bx \S)U5J) G C 

where ci is non-unit, c 2 is unit, S* is the selected atom in B\, and there exists a 
variant c' 2 = (S <— 5^) of c 2 s.t. V(ci) nV(S 2 ) ^ V(<S) and the /^-constraint (j/ of 
C3 is satisfiable. 

These rules can be rationalized as follows: The prediction rule proposes for the selected 

V C 

atom of a clause c\ a possible variant of a program clause c 2 using which an — — >-step, i.e., a 
combined goal-reduction and constraint-solving step, can be performed. For a unit clause c 2 , 

T C 

the completion rule then performs a combined — ^-step on the lefthandside atom of c 2 and 
substitutes this selected atom in clause c\ by the resulting righthandside L -constraint. Both 
rules collect the C -constraints of the antecedent clauses and take care of successful constraint 
solving and prevent accidental variable sharing in the consequent clause. 

Clearly, this combination of top-down prediction and bottom-up completion defines a 
search rule which can reduce the parsing complexity in comparison to backtracking methods. 
However, to make these inference rules a workable algorithm, several issues concerning the 
effective applicability of Ear ley deduction to different purposes have to be addressed. Since 
these topics are not of direct relevance for our problem, we refer the reader to the extensive 
literature on this subject (see, e.g., Pereira and Shieber (1987 ), Dorre (1993| ), porre and 



Johnson (1995| )). 

Let us illustrate the basic concepts of Earley deduction with a simple feature-based CLG. 
In the following example we will make use of a standard technique for string position indexing, 
e.g., the indexed clause 

sign(X,0,l) <- X = <t>. 

abbreviates an actual CLP clause 

sign(X, Y, Z) ^ X = (/) kY = & Z = 1. 

where the constants and 1 denote the start and end position of the span of the predicate 
in the input string. The string position can be read off for unit clauses from the lefthandside 
atom, but for non-unit clauses from the first string position argument of the head atom and 
the first string position argument of the leftmost atom in the body. Note that string position 
indexing is not mentioned in the definition of the inference rules for Earley deduction. In fact, 
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this indexing is not necessary for Earley deduction to work. Rather, it is an effective way to 
reduce the number of unsuccesful rule applications in an implementation, and will also make 
our example more transparent. 



Let us return for illustration to the simple example of Fig. 3Ji . An indexed variant of this 
program is given in Fig. 



1 pbxase(X,S ,S) <- X = (phrase A CAT : s A DTRlrCAT : n A DTR2:CAT : 
v A DTRFAGR : Y A DTR2:AGR : Y A DTR1 : Z\ A DTR2 : 
Z 2 ) & sign(Zi, S , Si) & sign(Z 2 , Si, S). 

1 phrase(X,5 ,S') <- X = (phrase A CAT : np A DTR1:CAT : n A DTR2:CAT : 
n A DTR1 : Zi A DTR2 : Z 2 ) & sign(Zi, S , Si) & sign(Z 2 , Si, S). 

3 word(A,0, 1) <- X = (word A CAT : n A PHON : Clinton A AGR : sg). 

4 uord(X, 1,2)*- X = (word A CAT : v A PHON : talks A AGR : sg). 

5 word(A, 1, 2) <- X = (word A CAT : n A PHON : talks A AGR : pi). 

6 sign(A, So, S) <— phrase(A, 5o, S). 

7 sign(X, So, S) <— word(X, 5o, 5). 

Figure 4.6: Indexed feature-based constraint logic grammar 
An application of Earley deduction to parsing the query 
sign(A,0,2) hX = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks). 
denoting the input sentence 

oClintoni talks2- 



is given in Figs. 4.7 and 4.5. 



A convenient way to illustrate graphically the relation of derived items in a chart to partial 
parses of an input sentence is by a chart graph. An chart graph for the sequence of derived 



clauses of Figs. 4.7 and 4.8 is given in Fig. 

This graph associates the numbers of derived clauses with directed edges which connect 
input string position nodes. Edges which point from a node to the node itself are attached 
with the numbers of clauses derived by prediction with non-unit clauses. In the language 
of context-free Earley parsing, such edges represent predictions on non-terminal symbols. 
Scanning of terminal symbols is represented by edges connecting a node with the next node 
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9 <— sign(X, 0, 2) kX = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks). (I) 

10 sign(X,0,2) <- phrase(A,0,2) k X = (sign A DTR1: PHON : Clinton A (P 9,6) 
DTR2: PHON : talks). 

11 phrase (X, 0, 2) <- sign(Zi, 0, Si)&sign(Z 2 , Si, 2)&X = (pfrraseACAT : sADTRl : (P 10,1) 
word A DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : 

word A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : 
Zi A DTR2 : Z 2 ). 

12 sign(Zi,0,Si) <- word(Zi,0,Si) k X = (phrase A CAT : s A DTR1 : word A (P 11,7) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : 

word A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : 
Zi A DTR2 : Z 2 ). 

13 word(Zi,0,l) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (P 12,3) 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : t> ADTR2: PHON : talks A DTR2: AGR : FADTR2: AGR : 
sg A DTR1 : Zi A DTR2 : Z 2 ). 

14 phrase (X,0, 2) <- sign(Zi, 0, Si) & sign(Z 2 , Si, 2) & X = (phrase A CAT : rip A (P 10,2) 
DTR1 : word A DTR1: CAT : n A DTR1: PHON : Clinton A DTR2 : word A 

DTR2: CAT : n A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ). 

15 sign(Zi,0,5i) <- word(Zi,0,Si) & X = (phrase A CAT : np A DTR1 : word A (P 14,7) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ). 

16 word(Zi,0, 1) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (P 15,3) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ). 

17 sign(Zi,0, 1) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (C 12,13) 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : t> ADTR2: PHON : talks A DTR2: AGR : FADTR2: AGR : 
sg A DTR1 : Zi A DTR2 : Z 2 ). 

18 phrase(A:,0, 2) <- sign(Z 2 ,l,2) kX = (phrase A CAT : s A DTR1 : word A (C 11,17) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : 

sg A DTR2 : word A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : 
Y A DTR2: AGR : sg A DTR1 : Zi A DTR2 : Z 2 ). 

19 sign(Zi,0, 1) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (C 15,16) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ). 

20 phrase(X,0, 2) <- sign(Z 2 ,l,2) k X = (phrase A CAT : np A DTR1 : word A (C 14,19) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : n A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ). 

Figure 4.7: Earley deduction chart 
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21 sign(Z 2 ,l,2) <- word(Z 2 ,l,2) &X = (phrase A CAT : s A DTR1 : word A (P 18,7) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : 

sg A DTR2 : word A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : 
Y A DTR2: AGR : sg A DTR1 : Zi A DTR2 : Z 2 ). 

22 word(Z 2 ,l,2) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (P21.4) 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : t> A DTR2: PHON : talks A DTR2: AGR : V ADTR2: AGR : 
sg A DTR1 : Zi A DTR2 : Z 2 ). 

23 sign(Z 2 ,l,2) <- word(Z 2 , 1, 2) & X = (phrase A CAT : np A DTR1 : word A (P 20,7) 
DTR1: CAT : n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : n A DTR2: PHON : talks A DTR1 : Z x A DTR2 : Z 2 ). 

24 word(Z 2 ,l,2) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (P 23,5) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR2: AGR : pi A DTR1 : Zj A DTR2 : Z 2 ). 

25 8ign(Z 2 ,l,2) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (C 21,22) 
7i A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : w ADTR2: PHON : taZfcs A DTR2: AGR : FADTR2: AGR : 
sg A DTR1 : Z x A DTR2 : Z 2 ). 

26 phrase(X,0, 2) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (C 18,25) 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : w ADTR2: PHON : talks A DTR2: AGR : F ADTR2: AGR : 
sg A DTR1 : Z x A DTR2 : Z 2 ). 

27 sign(X,0,2) <- X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : (C 10,26) 
n A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : 

word A DTR2: CAT : w ADTR2: PHON : talks ABTR2: AGR : Y ADTR2: AGR : 
sg A DTR1 : Zi A DTR2 : Z 2 ). 

28 sign(Z 2 ,l,2) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (C 23,24) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR2: AGR : pi A DTR1 : Z 1 A DTR2 : Z 2 ). 

29 phrase(X,0, 2) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (C 20,28) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR2: AGR : pi A DTR1 : Z 1 A DTR2 : Z 2 ). 

30 sign(X,0,2) <- X = (phrase A CAT : np A DTR1 : word A DTR1: CAT : (C 10,29) 
n A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word A DTR2: CAT : 

n A DTR2: PHON : talks A DTR2: AGR : pi A DTR1 : Z 1 A DTR2 : Z 2 ). 



Figure 4.8: Earley deduction chart, cont. 
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Figure 4.9: Chart graph 
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on its right. Such edges are attached with the numbers of clauses derived by prediction with 
unit clauses. Completion of non-terminal symbols is represented by edges connecting a node 
with a node possibly further on its right. Such nodes are attached with the numbers of clauses 
derived by the completion rule. 

There are two parses of the above input sentence hidden in the Earley deduction chart 



of Figs. 4.7 and 4.5. In the chart graph of Fig. 4.G, these two parses are represented by the 
upper and lower half of the symmetric graph. From each of the two final completed clauses, 
27 and 30, a proof tree representing a parse can be reconstructed using the algorithm of Dcf. 



4.6| . This algorithm defines the construction of partial proof trees from completed clauses, 
and when applied recursively, permits the construction of proof trees from a given Earley 
deduction chart. 



Definition 4.6. Let c& be a completed clause derived from clauses C{ and Cj, let ti and tj be 

the unique partial proof trees corresponding to Ci and Cj, and define for each predicted clause 

E 

[E <— F) a partial proof tree \ . Then the partial proof tree t(ti,tj) corresponding to Cfc is 

F 

constructed s.t. 



t(ti,tj) = < 



U 

if both Ci , Cj are completed clauses, 



if one or both Ci,Cj are predicted clauses, 



A 

h | ti 

and © = BuC , ® 
ti | t 2 

D 



B , if 



B\CUD 



ii = A 



B 



t-2 



c 

I 

D 



The proof tree for completed clause 27, corresponding to the parse [Clinton^ talksyjs, 
is the proof tree of Fig. ^ and repeated here in Fig. 4.1C. The parse {Clinton^ talks^NP 
is derived via the proof tree of Fig. 2.9, repeated here in Fig. 4.11, and can be reconstructed 
from completed clause 30. 



4.9 Parsing and Searching 



119 



X = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 

& sign(A) 

X = (sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 
& phrase (X) 



Xi ■ 



X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : Z x A DTR2 : Z 2 ) 
k, sign(Zi) & sign(Z 2 ) 



X = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR2 : word A DTR2: CAT : v 
A DTR2: PHON : talks A DTR2: AGR : Y A DTR1 : Zi A DTR2 : Z 2 ) 
&word(Zi) &sign(Z 2 ) 

A = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg 

A DTR2 : Z 2 ) & sign(Z 2 ) 

A = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg 

A DTR2 : Z 2 ) &word(Z 2 ) 



X2 : 



A = (phrase A CAT : s A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : Y A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : v A DTR2: PHON : talks A DTR2: AGR : Y A DTR2: AGR : sg) 



Figure 4.10: Proof tree for [Clinton^ talksy]s 
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X = {sign A DTR1: PHON : Clinton A DTR2: PHON : talks) 

& sign(X) 



X = {sign A DTRl:PHON : Clinton A DTR2: PHON : talks) 
& phrase (X) 



X = {phrase A CAT : np A DTR1 : word A 


DTR1: 


CAT : n 


A DTR1: PHON : Clinton A DTR2 : word A 


DTR2 


CAT : n 


A DTR2: PHON : talks A DTR1 : Z x A 


DTR2 




& sign(Zi) & sign(Z 2 ) 







X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR2 : word A DTR2: CAT : n 
A DTR2: PHON : talks A DTR1 : Zi A DTR2 : Z 2 ) 
&word(Zi) &sign(Z 2 ) 



X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2 : Z 2 ) 
& sign(Z 2 ) 

X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2 : Z 2 ) 
&word(Z 2 ) 



X = {phrase A CAT : np A DTR1 : word A DTR1: CAT : n 
A DTR1: PHON : Clinton A DTR1: AGR : sg A DTR2 : word 
A DTR2: CAT : n A DTR2: PHON : talks A DTR2: AGR : pi) 



Figure 4.11: Proof tree for [Clinton^ talks m\np 
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4.9.2 Probabilistic CLGs and the Viterbi Algorithm 



Turning to probabilistic CLGs, we see that because CLGs are simply instances of CLP, all 
techniques developed for statistical inference of probabilistic CLP apply to probabilistic CLGs 
without modification. The simplest way to inspect a probability distribution on parses is to 
list the respective proof trees and calculate their probabilities from the subtree-properties 
and the corresponding parameters. An imaginable probability model for the proof trees of 
Figs. 4.10| and 4.11 could take as properties the subtrees introduced by the clauses which 
are responsible for the two different readings of the input sentence, namely clauses 1, 4, 2, 
and 5. The respective properties xii X2, X3> an d X4 are depicted in Figs. [00 and |4.11| as 



framed parts of the proof trees. MLE from a large natural language corpus for the parameters 
Ai, A2, A3 and A4 corresponding to these properties would probably return a higher weight for 
parameters Ai and A2 than for A3 and A4. Thus this probability model would tell the proof 
tree of Fig. 4.10| , corresponding to the parse [Clinton^ talksy]s, to be more probable given 
the input sentence Clinton talks than the proof tree of Fig. 4.11, corresponding to the parse 
[Clinton^ talks^NP- 

However, if we are interested in the most probable parse of a sentence, listing all possible 
parses may be too costly in general, even if the parses just have to be extracted from a chart. 
Clearly, it would be nice if we could make use of the structure of the probabilistic model 



to guide the search for the most probable parse. The Viterbi algorithm ( Viterbi (1967 ), 
Forney (1973 )) for finding the most probable parse implements this idea using a dynamic- 
programming approach as follows: During derivation, each derivation state must keep track 
of the most probable path of derivation states leading towards it. When the final derivation 
state is reached, the maximum probability derivation can be recovered by tracing back the 
stored path of most probable derivation states. 

Clearly, different specifications of this algorithm depend on the chosen parsing strategy 
and on the underlying probability model. For example, Stolcke (1993| ) computes a Viterbi 
parse for probabilistic context-free grammars in a framework of probabilistic Earley parsing 
as follows: During derivation, each completed item keeps track of the most probable path 
of items contributing to it. The rule probabilities are propagated recursively by associating 
each predicted item with the probability of the rewriting rule used in the prediction, and 
by recording for each completed item the product of probabilities of the pair of items that 
contributes with maximal value to the completion. Storing at each completion step the item- 
pair leading to the maximum, finally yields a path of most probable items from which the 
most probable derivation can be retrieved. 

Under certain restrictions on the parsing strategy and on the probabilistic search method, 
the idea of the Viterbi algorithm is applicable to Earley deduction for log-linear probabilistic 
CLGs as follows. 
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Concerning the parsing strategy, let us strictly adhere to the definition of Earley deduction 
given above. That is, we only speak of an ambiguous derivation of a completed clause if more 
than one pair of clauses yields via completion the same clause with the same variable binding. 
That is, in this setting a numerical comparison at a completion step is done only between 
clause-pairs contributing via completion to the same "instantiated" clause. 

Considering the probabilistic search method, Stolcke (1993)'s model of Viterbi parsing can 



be reconstructed if we identify the properties of the log-linear model with program clauses. 
If properties are allowed to be subtrees of proof trees, things are more complictated. In this 
setting, in order to compare numerically between alternatives, we have to incrementally build 
up partial proof trees and check their properties during derivation. 

First, we have to define a function uu to calculate the weight of a partial proof tree tk 
under a log-linear probability model p\. 

Definition 4.7. Let C be an Earley deduction chart for query G and program V , let X be 
the set of proof trees for G from V , and let p\ be a log-linear distribution on X. Then the 
weight w of a partial proof tree tk constructable for a completed clause Ck € C is defined s.t. 

w (t k ) = e x < t *\ 



Furthermore, a numerical comparison between alternatives leading to the same completion 
requires the partial proof trees corresponding to the alternative completions to include only 
completely built-up subtree-properties. This is necessary to avoid the outranking of highly 
weighted partial proof trees by lower weighted partial proof trees at a completion step where 
the highly weighted subtree-properties cannot yet be taken under consideration. For an ap- 
propriate partial ordering on trees based on an operation C, we can ensure that partial proof 
trees include only completely built-up properties as follows. 



A partial proof tree tk is complete for a property- vector x = • • • > Xn) iff for 
each i = 1, . . . , n: %i ^ or else %i H tk = 0. 



The algorithm of Def. 4.6 can be used for a recursive comparison as follows. Note that we 
use the definition of variant given in Chap. ^ for the specification of an equivalence class of 
clauses to be compared. 



For each equivalence class [ck] of completed clauses, record the partial proof tree 
t* k = argmax w(tk), where [ck] = {c S C\ c is a variant of Ck, and there exist clauses 

Ci and Cj in C from which c is derivable via completion}, and tk E {t(t*,tj)\ t* and 
tj are the hightest weighted complete partial proof trees corresponding to clauses 
Ci and Cj}. 
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Figure 4.12: Type hierarchy 



Clearly, given the above restrictions, this procedure will yield the most probable proof tree 
for a given query to a program. The possible savings in computational complexity induced by 
this procedure depend on the size of the subtree-properties to be worked out during the search 
process. That is, small subtrees will permit an efficient pruning at nearly each completion step 
whereas subtrees connecting nodes over long distances may in the worst case yield no gain in 
efficiency at all. 



4.9.3 Heuristic Searching for Most Probable Parses 

However, the effective applicability of the search procedure stated above strongly depends the 
form of the grammars under consideration. That means, for particular CLGs, it is inefficient 
to restrict the numerical selection only to alternative completions which lead to the same 
clause with the same variable binding. The storing of variable bindings in each step of an 
Earley deduction procedure is necessary to enable partial proofs to be reused in other partial 
proofs. Unfortunately, deriving a new clause with each new variable binding may introduce 
overhead which causes in the worst case an exponential search cost. This can be the case, e.g., 
for grammars which encode parses entirely via variable bindings, i.e., via C -constraints, and 
in not via predicates, i.e., TZ(C) -atoms. The extreme ends of the spectrum of such examples 
can be marked, e.g., for the first case by CLGs resulting from a direct application of the 
compilation procedure of Gotz (1995| ). This procedure translates HPSG descriptions into the 



C -constraints of a CLP fragment using a single TZ(jC) -atom for processing. An example for 



the second case are definite clause grammars such as those presented in Pereira and Warren 



(1983 ) which encode each grammar symbol as a distinct CLP predicate. For cases like the 
first, it would be more effective if one could compare alternative completions leading to a 
variant of a CLP clause irrespective of the variable bindings. Unfortunately, this approach 
to comparing "uninstantiated" completed clauses introduces a context-dependence problem 
caused by incompatible variable bindings. That is, we are confronted here with a trade-off 
between efficiency and correctness of the search method. 

Let us illustrate this context-dependence problem with a simple example. For illustration 



we use the program of Fig. 4.1, repeated here in Fig. 4.13j , with C -constraints from a language 



of hierarchcal types. The ordering on the types is depicted in Fig. 4.12 



An Earley deduction chart for the query s(Z) & Z = e is given in Fig. 4.14 
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1 s(Z) 


«- p(Z) & 


2p(Z) 


<— Z = a. 


3p(Z) 


<- Z = 6. 


4p(Z) 


-2 = /. 


5q(Z) 


<- Z = a. 


8q(Z) 


<-£ = &. 



q(Z). 



Figure 4.13: Constraint logic program 
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^s(Z) 


&Z = 


e. 


(I) 


8 


B(Z)<" 


p(Z)& 


q{Z) hZ = e. 


CP 7.1) 


9 


p(£)<- 


Z — a. 




(P 8,2) 


10 


P (Z)^ 


Z = b. 




(P 8,3) 


11 


p(Z)«- 


z = f. 




(P 8,4) 


12 


b(Z)<- 


q(Z)& 


Z = a. 


(C 8,9) 


13 


b(Z)<- 


q(Z)& 


Z = b. 


(C 8,10) 


14 


s(Z)«- 


q(Z)& 


z = f. 


(C 8,11) 


15 


q(Z)«- 


Z = a. 




(P 12,5) 


16 


q(Z)«- 


Z = 6. 




(P 13,6) 


17 


b(Z)<- 


Z — a. 




(C 12,15 


18 


s(Z)«- 


Z = b. 




(C 13,16 



Figure 4.14: Earley deduction chart 



Let the properties xi to X5 of a probability distribution over the proof trees corresponding 



to the Earley deduction chart of Fig. |4.14| be defined as the framed subtrees shown in Fig. 
4.15 . Furthermore, let the corresponding parameter values be Ai = In 2, A2 = In 3, A3 = 
In 5, A4 = In 5 and A5 = In 3. Now let us take a look at how the probability model defined by 
these properties and parameters guides the search for the most probable proof tree of query 
s(Z) & Z = e from the program of Fig. 4.13| . The first decision to be made is between the 
completed clauses 12, 13 and 14, which differ only by their variable bindings, i.e., by their 
/^-constraints. The partial proof trees corresponding to these clauses, ti2, £13 and £14, are 
shown in Fig. [4.15 . However, t±4, the highest weighted of these partial proof trees, is not 
included in any proof tree, £17 or t^, corresponding to the final completion steps. That is, in 
this case the probabilistic search has not only missed the most probable proof tree but has 
led to failure! Even if we ignore the clauses contributing to this failure, namely clauses 4, 11 
and 14, a problem still remains. In this case, w (t i3 ) > tti(ii2), and the weight of proof tree 
iis including the best partial proof tree £13 is u>(iis) = 9. However, the weight of proof tree 
tn including the partial proof tree t±2, which we just have thrown away, is w(tn) = 10 and 
w{tn) > w(tis). Thus in this case the probabilistic search method has led us to the lower 
weighted proof tree. 

Clearly, the first of these problems can be solved by constructing the Earley deduction 
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s(Z)kZ = e 
p(Z) & q(Z) k,Z = e 



s(Z)&zZ = e 
p{Z) & q(Z) kZ = e 



s(Z)^Z = e 
p(Z) & q(Z) kZ = e 



X i <l(Z) kZ = a 



X 2 q(Z) kZ = b 



X 3 q(Z) &cZ = f 



kr: 

s(Z) hZ = e 
p(Z) Sz q(Z) Sz Z = e 



Xi 



q(Z) k,Z = a 



X4 Z = a 



tis- 

s(Z) hZ = e 
p(Z) Sz q(Z) Sz Z = e 



X-2 



q(Z) kZ = b 



XsZ = b 



Figure 4.15: Partial proof trees 



chart in advance and by checking the terminal C -constraint of each partial proof tree corre- 
sponding to an alternative completion against the C -constraints of the final completion steps 
in the chart. This can be accomplished by the the following re-definition of the equivalence 
class [cfc] of completed clauses which is subject to a numerical comparison in a completion 
step. Note that we refer here to the definition of variable renaming given in Chap. |2[ 

Let C be an Earley deduction chart for query G and program V , Cfc = (A <— 
B\ & ... B n & tp) be a clause in C, and c' k = \ ip. Then an equivalence class [cfc] 
of completed clauses in C is defined s.t. 

[cfc] = {c G C\ there exist clauses Cj and Cj in C from which clause c = (C <— 
L>i & ... & D m & cj)) is derivable via completion, c' = c \ cj) is obtained from 
c^, by simultaneously replacing each occurence of a variable X in c' fe by a 
renamed variable for all variables X G ^(c^) for a renaming p, and 

there exists a satisfiable C -constraint (f> &z (p for at least one final completed 
clause in C with C -constraint (p}. 

However, the latter of the problems stated above is solvable only at the cost of re- 
introducing the restriction to compare only "instantiated" variants of completed clauses, i.e., 
variants of clauses with the same C -constraints. Each search algorithm which allows "unin- 
stantiated" variants of completed clauses to be compared, necessarily provides only a heuristic 
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search procedure in the sense that it does not guarantee that the most probable proof tree 
is found. According to the definition of [cfc], the equivalence class of completed clauses which 
are compared in a completion step, and the satisfaction of the completeness requirement on 
partial proof trees, the algorithm of Table 4.9,3 defines either an approximate heuristic or a 
true best-parse search algorithm. 

An approach to a Viterbi-like heuristic search procedure similar to ours is used by |Carrol] 



and Briscoe (1992]) for searching the parse forest produced by their probabilistic LR parser for 
unification-based grammars. There also the parse forest must be built up completely before 
unpacking to ensure that the search algorithm does pursue successful derivations. 

On the whole, the decision between a possibly more efficient, but only approximate heuris- 
tic procedure and a possibly inefficient, but optimal Viterbi algorithm has to be made with 
respect to particular classes of CLGs in mind. Furthermore, if a heuristic search procedure is 
used, an alternative to completing the chart in advance is to using a backtracking procedure in 
connection with an incremental computation of clauses and corresponding best partial proof 
trees. 



4.10 Summary and Discussion 

In this chapter we presented a probabilistic model for CLP and a novel method for statisti- 
cal inference about the parameters of such models from incomplete data. We discussed the 
problems of previous approaches which applied Baum's estimation technique for stochastic 
context-free models to estimation of stochastic constraint-based models. We showed with a 
counterexample that this incomplete-data estimation method does not generally yield the 
desired maximum likelihood values when applied to constraint-based systems. To overcome 
the inherent context-dependence problem of such systems, we introduced a powerful log- 
linear probability model for CLP. Furthermore, we presented a new algorithm to infer the 
parameters of log-linear models, and also the properties of such parametric models, from in- 
complete training data. We showed monotonicity and convergence of the algorithm to the 
desired maximum likelihood estimates and applied it experimentally to estimation of a CLG 
on a small scale. Furthermore, we discussed various methods for approximate computation of 
the formulae involved in the inference task, and presented methods which use the structure 
of the probabilitic model to guide the search for the most probable analysis. To this end we 
presented an approximate heuristic search algorithm based on dynamic programming tech- 
niques. Depending on the class of grammars under consideration, this algorithm can provide 
a considerable efficiency gain in searching for the most probable analysis. 

In comparison to the work on quantitative CLP presented above, the advantages of proba- 
bilistic CLP are clearly the possibility to use automatic techniques for statistical inference for 
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Input Log-linear model p\ on set X of proof trees for goal G from program V , 
weight function w, tree-constructor function t, Earley deduction algorithm, 
choice of equivalence class of completed clauses. 

Output Best proof tree t* k for G from V . 

Procedure 



Until no clauses can be added 

Compute clauses by Earley deduction algorithm, 
If Cfc is a completed clause, 
Then w* := 0, compute [c k ], 
If [cfc] = [q] for some I < k, 
Then t* := t*, 
Else for each c <G [c&], 

For each q, Cj which derive c via completion, 
Compute the best proof tree t* for q, 
Compute the best proof tree t* for cj, 

If w(t k ) > w*, 

Then w* := w(t k ), t* k := t k , 
Else delete c, 

Return t* k . 



Table 4.13: Algorithm (Best-Parse Search) 
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parameter estimation and property selection. Rather, our incomplete-data inference algorithm 
is general enough to be applicable to log-linear probability distributions in general, and thus 
is useful in other incomplete-data settings as well. In this chapter the algorithm has especially 
been shown to be useful for probabilistic context-sensitive NLP models. In contrast to related 
approaches such as that of Magerman (1994| ), Ratnaparkhi (1998| ) or |Goodman (1998|) , which 
require fully annotated corpora for estimation, our statistical inference algorithm provides 
general means for automatic and reusable training of arbitrary probabilistic constraint-based 
grammars from unannotated corpora. Furthermore, our approach is the first one since the 
introduction of log-linear models into the discussion of probabilistic parsing by jAbney (1991 ) 
which evaluates experimentally the usefulness of general log-linear models on CLGs. 



Chapter 5 



Conclusion 



In this final chapter we present a short summary of the work of this thesis. We compare the ad- 
vantages and shortcomings of the two presented approaches to quantitative and probabliistic 
CLP relative to each other and relative to other approaches. Not surprisingly, the presented 
work is not definitive but raises several questions which could not be answered in the course 
of this thesis. These questions will be dealt with when we discuss future continuations of the 
presented work. 



5.1 Summary 

In this thesis, we have presented new mathematical and algorithmic techniques for quantita- 
tive and statistical inference in constraint-based NLP. We have chosen the general concepts of 
CLP as the formal framework to deal with constraint-based NLP, yielding CLGs as instances 
of CLP. Aiming at a general solution of the problem of structural ambiguity in CLGs, we 
have presented two independent approaches to weighted CLGs. 

The first approach, called quantitative CLP, is situated in a clear logical framework, and 
presents a sound and complete system of quantitative inference for definite clauses with sub- 
jective weights attached to them. This approach permits to specify weights in arbitrary ways, 
e.g., as subjective probabilities, user-defined preference values, or degrees of grammaticality, 
and to use search techniques such as alpha-beta pruning for finding the maximally weighted 
proof tree for a given set of queries efficiently. Related previous work either focussed solely 
on formal semantics of quantitative logic programs without specific applications in mind, 
or presented only informal attachments of weights to grammar components for the aim of 
weight-based pruning in natural language parsing. Our approach is the first one to combine 
weight-based parsing for constraint-based systems with a rigid formal semantics for such 
quantitative inference systems. 
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The second approach, called probabilistic CLP, addresses the problem of structural ambi- 
guity resolution by a completely different form of weighted CLGs. Here a log-linear probability 
model is presented which defines a probability distribution over the proof trees of a constraint 
logic program on the basis of weights assigned to arbitrary properties of these trees. The pos- 
sibility to define arbitrary features of proof trees as such properties and to estimate appro- 
priate weights for them permits the probabilistic modeling of arbitrary context-dependencies. 
In this thesis we firstly evaluate empirically the applicability and feasibility of estimation of 
general log-linear models on CLGs. In contrast to previous approaches which were restricted 
to estimation from annotated data for specialized probabilistic parsing models we present an 
algorithm to estimate the parameters and to induce the properties of log-linear models from 
incomplete, unanalyzed data. The new algorithm has the same computational complexity as 
related complete-data inference algorithms for log-linear models. Furthermore, we address 
the problem of computational intractability of large summations in the inference task by 
discussing various techniques to approximately solve this task and present an approximate 
heuristic search algorithm for CLGs. 



5.2 Future Work 



As shown in Sect. 4.7, the empirical evaluation of estimating log-linear models on CLGs 
showed promising results both for training and evaluation on a small scale. Clearly, the main 
task of future work is a thorough investigation of the performance of the presented general 
algorithms on larger scales of real- world NLP applications. In larger experiments issues which 
were addressed so far only theoretically shall be evaluated in practice. Such issues are the 
empirical evaluation of property selection, the evaluation of various approximation methods 
in parameter estimation, or the empirical testing of the performance of non-heuristic versus 
heuristic search techniques in terms of linguistic results. 

New issues which shall be addressed in larger experiments are the use of dynamic pro- 
gramming techniques not only for searching for best parses but also for efficient calculation of 
expectations in the estimation process. Similar to the heuristic Viterbi algorithm presented 
for best parse search the application of dynamic programming to computing expectations will 
be possible only in a heuristic way. Clearly, the question to be addressed is how such heuristic 
estimation procedures perform in terms of linguistic evaluations. 

Another issue that will become important for larger data sets is the use of reference 
distributions as simpler and easier to estimate back- off models. A reasonable choice of a 
reference distribution for our task is, e.g., a model defining a probability distribution on lexical- 



semantic head-head relations such as verb-noun pairs (see, e.g, [Rooth, Riezler, Prescher^ 



|Carroll, and Beil (1999| )). Such a clustering model does not require complex parsing models 
or costly annotated corpora, but can be estimated easily from large corpora of verb-noun 
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pairs. Furthermore, such a class-based model will also provide a smooth default distribution 
and thus help to solve the sparse data problem. 

A further task of future work will be the investigation of possible applications of log-linear 
models and incomplete-data estimation to NLP applications different from parsing. 
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